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In This Issue 


This issue of Survey Methodology contains articles dealing with a variety of subjects. In the 
first article, Steel, Holt and Tranmer examine the problem of using aggregated data in studies on 
relationships at the individual or household level. They propose a simple general model that seeks 
to take account of the geographical effects of aggregation. They then describe how this model 
effects both the estimation of population means and covariance matrices and analysis at the regional 
level. In addition, by introducing auxiliary variables for which certain external sources provide 
an estimate of the covariance matrix at the unit level, the authors propose methods that provide 
an unbiased estimate of the parameters at the individual level, so as to avoid the effect of 
geographical aggregation. 

Binder gives a ‘‘cookbook”’ approach for deriving Taylor series approximations to the variances 
of a wide class of estimators from complex surveys. Several useful examples are presented, as well 
as new results on the application of this general technique to two-phase sampling. A justification 
of this method is given, showing the procedure to be consistent with the formulation given in earlier 
work by Binder and Patak. 

Yung and Rao suggest a linear approximation to the jackknife variance estimator. This linearized 
jackknife inherits the good statistical properties of the usual jackknife variance estimator but is 
computationally much less intensive. The specific form of the proposed variance estimator is 
developed for the generalized regression estimator of a total and for the ratio of two generalized 
regression estimators. In a simulation study using data from the U.S. Current Population Survey, 
they found that the jackknife, the linearized jackknife, and the usual linearization variance 
estimators worked quite well for poststratified estimates of a total, while an incorrect form of the 
jackknife was badly biased. 

Chaubey, Nebebe and Chen consider use of an Inverse Gaussian model for positively skewed 
data and develop a corresponding model assisted estimators for domain totals, which consist of 
Inverse Gaussian regression predictors together with an expansion estimators of the regression bias. 
A modified version of the estimator which gives reduced weight to the bias correction term, 
analogous to a modified regression estimator proposed by Sarndal and Hidiroglou, is also proposed. 
In a simulation study using synthetic income data based on Statistics Canada’s Survey of Household 
Income, Facilities and Finance the proposed estimators are found to work reasonably well. 

Rizzo, Kalton and Brick investigate the use of auxiliary information in compensating for panel 
nonresponse through weight adjustment techniques. Using data from the Survey of Income and 
Program Participation (SIPP) to illustrate, they address two important issues, namely, the choice 
of auxiliary variables to be used in a nonresponse weight adjustment technique, and the choice 
of technique itself. A screening procedure in conjunction with logistic regression modelling are the 
means by which appropriate auxiliary variables are chosen. The nonresponse weighting adjustment 
methods considered are based on logistic regression models, categorical search algorithms and 
generalized raking. An empirical comparison of the various methods is discussed in detail. 

Ding and Fienberg develop models of matching error which can be used in estimation of total 
population from a probabilistic match of two or more samples. They develop their models for the 
particular application of a multiple sample census, that is, a census supplemented by auxiliary 
samples. They illustrate the usefulness of their methods by applying them in an analysis of the 1988 
St. Louis Dress Rehearsal Census data for which three samples were matched: the Census itself, 
the Post Enumeration Survey sample, and the Administrative List Supplement. 

In a paper on optimal stratification, Slanta and Krenzke talk about the use of the Lavallée- 
Hidiroglou method. This iterative method minimizes the sample size while fixing the coefficient 
of variation. In a practical illustration, the authors present the difficulties with the Lavallée- 
Hidiroglou method and show how they were resolved. 


In This Issue 


Dagum proposes a new method for estimating underlying trends from seasonally adjusted data. 
The approach consists of two steps. The seasonally adjusted data are first extrapolated based on 
an ARIMA model. A 13-term Henderson filter is then applied to the extended series, using strict 
sigma limits for the identification and replacement of extreme values. The new method is compared 
to the standard method using data from several economic time series. It is found that the new 
method produces fewer unwanted ripples in the estimated trend, while identifying turning points 
as just quickly and requiring smaller revisions on average. 

Tillé proposes an algorithm that generalizes the selection-rejection method used for constructing 
a simple random sample without replacement. A specific case of this algorithm, which is called 
the ‘‘mobile stratification algorithm’’, is discussed. It serves to obtain a smoothed stratification 
effect by using as a stratification variable the serial number of the units of observation. This 
algorithm gets around the thorny problem of a continuous variable in strata. 

De Waal and Willenborg review recent research on statistical disclosure control for microdata 
files from the perspective of Statistics Netherlands. Models are developed for the probability that 
a particular record could be re-identified and for the probability that some record in a microdata 
file could be re-identified. Global recoding and local suppression are considered as methods to 
reduce disclosure risk. They conclude that there is still much need for further methodological 
research and development of efficient software. 

Finally, it is with sadness that I note the recent passing away of Maria Gonzalez, who died of 
cardiac arrest while vacationing in Puerto Rico this past February. Among her many contributions 
to the statistical community, for the past several years Maria has been an Associate Editor for the 
Survey Methodology journal. Her contribution in this capacity to the quality and breadth of this 
journal was very much appreciated, and she will be sorely missed. An obituary, written by Elizabeth 
and Fritz Scheuren, appeared in the April issue of Amstat News. 


The Editor 
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Making Unit-Level Inferences From Aggregated Data 


D.G. STEEL, D. HOLT and M. TRANMER! 


ABSTRACT 


Data are often available only as a set of group or area means. However, it is well known that statistical analysis 
based on such data will often produce results very different from those obtained from analysing the corresponding 
individual or household data. If the results of area level analyses are thought to apply to the individual level then 
we risk committing the ecological fallacy. Aggregation or ecological effects arise in part because geographic areas 
are not comprised of random groupings of people or households but exhibit strong socio-economic differences 
between areas. The population structure must be incorporated into the statistical model underpinning the analysis 
if aggregation effects are to be understood. A simple general model is proposed to achieve this and the consequences 
of the model and its implications for the estimation of population means and covariance matrices are obtained. 
Furthermore, methods are suggested which can provide unbiased estimates of individual level parameters from 
aggregated data and so avoid the ecological fallacy. These methods rely on identifying the ‘‘grouping variables’”’ 
that characterise the process that led to the population structure, or at least characterise the area differences. An 
estimate of the unit level covariance matrix of the grouping variables is required from some source. Data from the 
1991 Census of the United Kingdom have been analysed to identify the important grouping variables and evaluate 
the effectiveness of the proposed adjustment methods for the estimation of covariance matrices and correlation 


coefficients. These results lead to a suggested strategy for the analysis of aggregated data. 


KEY WORDS: Aggregation; Ecological fallacy; Grouping; Selection; Variance components. 


1. INTRODUCTION 


Researchers are often faced with the problem of wishing 
to investigate individual level relationships but having to 
make use of aggregated data, such as the means or totals 
for geographic areas. Ideally unit level data collected in 
a sample survey or census would be used, but may not be 
accessible because of confidentiality restrictions, or because 
the variables have not been collected in a recent survey or 
census. Administrative systems provide information on a 
range of variables, for example on unemployment, health, 
morbidity, but because of confidentiality requirements 
these data are usually made available for aggregates, such 
as geographic areas. The census also provides data for 
geographic areas. For these reasons, analysis of group level 
data is still an option used widely in social and epidem- 
iological research. 

Consider a population in which each individual has 
associated a vector of variables of interest, whose distri- 
bution has mean p, and covariance matrix Ly,. We are 
interested in relationships among the variables of interest 
as reflected by correlations, regression coefficients and 
principal components, which may all be derived from the 
covariance matrix, L,,, which is our basic target of 
inference. For example, the variables of interest might 
include a set of attainment tests in an educational study; 
the incidence of a particular disease and a set of explan- 
atory variables in an epidemiological study; or a set of 


deprivation measures in a sociological study. We suppose 
that individual level data are unavailable. However, the 
region may be subdivided into a set of small areas such as 
Census Enumeration Districts (EDs), and for each small 
area, g, or for a sample of areas, we observe the vector of 
average values y, for the variables of interest together 
with the sample size n, on which this is based. 

The objective of the analysis, L,,, is a covariance 
matrix which spans the small areas. The target of inference 
is not conditional on small area membership but refers to 
the marginal distribution across small areas. This contrasts 
with situations, such as small area estimation, in which the 
target of inference is in the conditional distribution given 
the small area. This is a separate, legitimate objective with 
which we are not concerned. The same models may be 
applicable, but the targets of inference are different. 
However, our formulation does allow for group specific 
variables to be included as variables of interest if required. 
For example, if we associate with each individual a set of 
ED means for the area in which the individual is located, 
then these can be included within the vector, y, of interest. 
In particular, regression analyses which include small area 
means as explanatory variables in the regression model can 
be encompassed by the approach. 

The literature associated with the analysis of aggregated 
data dates back to Gehlke and Biehl (1934) and includes 
significant contributions by Yule and Kendall (1950) and 
Robinson (1950), Blalock (1964), Openshaw and Taylor 
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(1979) and more recently Arbia (1989). There are also 
problems associated with the fact that the areal units used 
often have no special significance, being constructed for 
reasons of cost, operational or administrative convenience. 
Moreover, the results of the group level analysis will 
depend on the scale of the units, that is their average size 
and the particular set of boundaries chosen. Several empir- 
ical studies have demonstrated these effects, including 
Clark and Avery (1976), Perle (1977), Openshaw (1984), 
and Fotheringham and Wong (1991). However, these 
studies have not provided any generally applicable theory 
or practical methods of modifying the results of group 
level analyses to provide reliable unit level inferences. 

Aggregation effects arise because geographic units are 
not comprised of random groupings of people. Individuals 
in the same area generally tend to be more alike because 
they choose to live in areas in a non-random way, or 
because they are subjected to common influences, or 
because they interact with one another. Thus there are 
socio-economic differences between areas which are 
confounded with the individual effects in any statistical 
analysis performed using aggregated data for the areas. 
A simple general model is proposed which seeks to incor- 
porate these effects. The consequences of this model and 
its implications for area level analysis are obtained. 
Furthermore, methods are suggested which provide, under 
certain circumstances, unbiased estimates of individual 
level parameters from aggregated level data and so avoid 
the ecological fallacy. These methods involve auxiliary 
variables for which a unit level sample covariance matrix 
is available from some source. This approach has been 
applied to data from the 1991 Census of the United 
Kingdom and a strategy developed for the analysis of 
aggregated data. 


2. MODELS FOR AREA EFFECTS 


We consider a population of Nindividuals each having 
a vector y of characteristics of interest. The population is 
comprised of M groups and the random variable ¢; indi- 
cates the area to which the i-th population unit belongs. 
The number of individuals in the g-th area is N,. 

We consider , and L,,, to be superpopulation para- 
meters and the following statistical theory is obtained in 
this framework. However, we consider some survey design 
issues at the end of section 2. 

We assume that there exists a sample data set s of size 
n and that these individual data have been aggregated to 
provide a set of m area means which are available for anal- 
ysis. The following area level statistics can be calculated: 


the g-th area mean: 


Vous es 3 Jj (2.1) 


n 
& i€g,8 


the overall sample mean: 


peel , ane . 
a ee eae (2.2) 


ges i€és 
the area level sample covariance matrix: 


i 1 ey 
Syy = PA 9 Ng (Vg oa y) (Vg a y) c (2.3) 
ges 


Analogous unit level statistics may be defined but 
will be unavailable to the analyst. For example S,, = 
1/(n — 1) Mies (¥; — YY; — ¥)’ is the unit level sample 
covariance matrix. 


2.1 Random Grouping 


While geographic groups are rarely formed randomly, 
such a situation is a useful starting point in considering 
ecological analysis. If groups are randomly formed then 
many group level analyses are valid, albeit with a reduced 
efficiency. Steel and Holt (1995) consider the properties 
of statistics such as means, variances, regression and 
correlation coefficients in this situation. When the groups 
are randomly formed i.e., y 1 c then 


Ely, | s,c] = Hy (2.4) 
i} 1 

AGERE 2 ee (2.5) 
Ng 


The basic properties of the unit and group level statistics 
then follow readily 


Cov(¥,, In | 5,0) =0 gh (2.6) 
Ely | s,c] = py (2.7) 

12) Ree | Ge ps (2.8) 
BLS alee ee (2.9) 


These properties apply if the sampling is ignorable given 
the group indicatives, which means the sample design can 
depend on the groups but not on y or any variable which 
is related to y conditional on c. For example a census or 
a simple random sample of groups and units within groups 
may be used. 

Unweighted group level statistics may be used by setting 
ng = | in equations (2.2) and (2.3). This leads to ineffi- 
cient estimators. The degree of inefficiency will depend on 
the distribution of the group sample sizes. Weighting by 
the group sample sizes is important and when this is done 
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inference can proceed as usual with appropriate adjust- 
ments to the degrees of freedom. Variability is determined 
by the number of areas rather than the number of indi- 
vidual observations and confidence intervals and tests are 
adjusted accordingly. 


2.2 A Variance Component Model 


A simple way to represent the positive intra-group 
correlation that is usually observed in grouped populations 
is through a variance components model, which in the 
multivariate case corresponds to 


Y=zHW tute ii€sg 
where vu, and €; are independent random components at 


the group and individual level respectively, both with zero 
expectation, V(€; | c) = L<¢< and V(v, | c) = Ayy. 


Model A: 
Ely; |¢] = py (2.10) 
V(y¥, | ©) = Lee + Ay = Ly (2.11) 
OV AV EV aby C aah Ayyricll Cin pe Cig al, 
(2.12) 


= 0 otherwise. 


The notation V(- | c) implies the covariance matrix 
conditional on the group labels c and hence determines 
common group membership. It is, however, taken to be 
unconditional over the group level random effects. Thus 
V(y; | ¢) contains the total variance from both the within 
group covariance matrix L,, and the group level covariance 
matrix A,,. 

The properties of the sample group level means follow 
readily from Model A, if the sampling is ignorable given c, 


Ely, | se] = py (2.13) 
1 

V(¥, | S,¢) = es (Zyy + (nz, = 1)Ay,) (2.14) 
g 

Cov(¥z,In | 5c) =0 g Hh. (2.15) 


The properties of the unit level and group level statistics 
are 


E[y | s,c] = py (2.16) 


fiend 


E[S,y | $,¢] = Ly — (2.17) 


EiSyplisie] elelyyer e(n*vestd dy, (2.18) 
Whete 17/11 Lene =A VP E)) Am= 
A(1 — Cr/(m — 1)) and C? = 1/m Yges(Mp — A)?/A? 
is the square of the coefficient of variation of the group 
sample sizes in the sample. We note that the coefficient 
of A,, is 0(m7—!) in (2.17) but is 0(#) in (2.18). This 
illustrates how a small bias in the unit level analysis can 
be magnified into a much larger bias in the aggregate level 
analysis. We will discuss these results further in section 2.4. 


2.3. Grouping Models 


In the discussion of ecological analysis, models have been 
proposed which take into account the group formation 
process. In this approach it is assumed that there is a grouping 
process which allocates individual units to groups according 
to a vector of grouping variables, z;, either stochastically 
or deterministically. This approach is implicit in Blalock’s 
(1964) analysis and used explicitly by Hannan and Burstein 
(1974), Litchman (1974), Langbein and Litchman (1978), 
Smith (1977) and Blalock (1979, 1985). Steel (1985) refers 
to these models as grouping models since it is assumed that 
groups are formed by some process involving the variables 
in the relationships under study. The grouping is seen as 
a distorting effect and the relationships of interest are defined 
before the grouping has occurred. It is often noted in the 
discussion of contextual models that apparent contextual 
effects may in fact be due to such factors. The multivariate 
version of this model is: 


Model B: 
E([y; | Z,C] = Py.z aij By. Zi (2.19) 
V(y; | z¢) = Ly, (2.20) 
Cov(y;,y; | zc) =0 TAs. (2.21) 


In this model the conditional expectation of y; depends 
only on the value of the auxiliary variables for the i-th unit 
and is independent of the group to which the unit belongs 
or the values of the auxiliary variables of other units in the 
population. The conditional covariance between any two 
units is zero. This model covers grouping models in which 
the group formation process is characterised by the auxil- 
iary variables z;. The auxiliary variables can be thought 
of as those variables that determine to which group a unit 
belongs. More generally, the auxiliary variables can be 
regarded as the main individual level variables whose distribu- 
tions are not random across groups because of the choice 
or migration processes to which the population has been 
subjected. Contextual variables can also be included in this 
model as auxiliary variables which take the same value for 
each unit in the group. 
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If the vector of auxiliary variables has a marginal dis- 
tribution with mean yw, and covariance matrix L,,, then 
the marginal mean and covariance matrix of y are given 
by My = My.z + Byz wu, and Lyy = Lyy.z at Byz Lez Byz 
respectively. The properties of the sample group level 
means follow readily from Model B: 


Ely; \s,z,€) sp, eB, (gir 2) (2.22) 
1 

VGz| Sac" = (2.23) 
Ng 

Cov(¥g,In | 5,20) =0 gFh. (2.24) 


The group level statistics then have the following 
properties 


Ely PS,250)) =" bye tee (ee pe) (2.25) 
E[Syy | S,Z,C] = Lyy as Bye (Szz Wt Lizz) Byz (2.26) 
E[Syy | $,2,¢] = Ly + Bye (See — Lee)Bz (2.27) 


where S,, and §,, are defined analogously to S,, and S,, 
as given in equation (2.3) and the sentence that follows it. 


2.4 A Combined Model 


The two models considered so far can be thought of as 
competing explanations of the group effects, but they can 
be combined into a more realistic model which contains 
both grouping effects and residual variance components: 


Model C: 
FONE ar RRR eee (2.28) 
V(y; | Z.¢) = Ly, (2.29) 
Gov(yp,y;: 250) S Agyz, ianeauCpa t= J 
(2.30) 


O otherwise. 


This model allows for group formation processes which 
are characterised by the auxiliary variables z,. It also 
includes residual within group correlations which reflect 
random effects which are interpreted as due to unobserved 
random group level variables after allowing for the 
grouping variables. 

The properties of the sample group level means follow, 
if the sampling is ignorable given (z,c) from Model C, 


El ypaksizse) tip dibs Zesore) (2.31) 
and 
2 1 
Lae soe ee (Lygerte (geel)Ayy.z) 2-32) 
& 
Cov(Ip,Hn | 5,%,¢) = 0 COE) (2.33) 
PALA Sze ety tye (ere (2.34) 
E[Syy | $,Z,¢] = Ly + By (Sz, — Lzz) Bye 
=0 
no —1 
& Avy rn(2.35) 
ES, yslssiziela= Deets Bie OS e—-8 es) Bip 
+ (A* —1)A,,,. (2.36) 


Equations (2.17) and (2.18) showed how the effect of 
aggregation in the variance components model, A, ampli- 
fies the contribution of the random group level effects. In 
equation (2.17) the coefficient of A,, is O(m~ ') whereas 
in (2.18) it isO(“#). For the combined model, C, equations 
(2.35) and (2.36) show how inclusion of the grouping 
variables permit the partition of the bias into two additive 
terms: the first related to the grouping variables, their rela- 
tionship to the variables of interest and their aggregation 
effect and the second term involving A,, ., the residual 
components of variance after controlling for the grouping 
variables. Note that the coefficients of A,, , in equations 
(2.35) and (2.36) are still 0(m~!) and 0(A) respectively 
as they were in equations (2.17) and (2.18) but the residual 
components of variance should in general be smaller. The 
basic assumption in (2.29) is that the residual variance is 
constant across c. 

The assumption that the sampling is ignorable given 
(z,c) means that the sample design can depend on the aux- 
iliary variables and the group indicatives. This allows, for 
example, the use of stratification based on the values of 
z and cluster or multi-stage sampling based on the groups. 

The weighted group level matrix Sy is intended to 
estimate L,,. The first bias term in (2.36) is due to the 
effect of the grouping variables and will be zero if 6), = 0 
or approximately so if S,, = L,,. The condition 6,, = 0 
is a strong condition and implies that the variables of 
interest are unrelated to the grouping variables. The effect 
of aggregation on the sample covariance of any two 
variables will depend on the relationships of the variables 
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with the grouping variables z; and we would expect the 
aggregation effects to be greater for variables more closely 
related to the grouping variables. The condition Sy SL, 
implies that there are no selection or aggregation effects 
for the z variables. These conditions are unlikely to apply 
in practice and hence bias will result for many variables. 
The bias due to the sampling and grouping involving the 
auxiliary variables is determined by S,, — ,, for the unit 
level estimator and by Sh — L,, for the group level esti- 
mator. The term §,, — ¥,, reflects the net effect of the 
sampling and aggregation on the auxiliary variables. 

The second bias term in (2.36) will be zero if A,,, = 0 
which implies that, conditional on the grouping variables, 
there is no residual intra-group correlation among the y 
variables. This is unlikely to occur in practice but it is 
desirable to identify grouping variables that account for 
as much of the aggregation effects as possible by making 
this residual term as small as possible. 

The effects due to the grouping and sampling depending 
on z and the effect due to the residual within group corre- 
lation are additive; this will be the case for more complex 
forms of within group correlations provided the linearity 
of the model holds. If z follows a simple variance compo- 
nent model, like Model A then 


EPS cl Lert — 1 AL 


E[Sy, | s,¢] = Lyy+ Gye 1) By, Az Bye 1 Ayy.z 
(2.37) 


and the intra-group covariances of the variables of interest 
are composed of a component due to the intra-group 
covariances of the auxiliary variables and the residual 
components. The right hand side of (2.37) represents a 
partition of (2.18) since if z follows a variance components 
model then so does y unconditionally. The motivation 
behind the basic model is to find auxiliary variables so that 
the residual or conditional within group covariances A), , 
are small or, ideally, disappear. 


2.5 Adjusting for Aggregation Effects 


Few useful proposals have been made on how to adjust 
the area level analyses to produce reasonable estimates of 
the unit level relationship. Duncan and Davis (1953) 
considered the possible range of the correlation coefficient 
calculated from a 2 by 2 table with known margins. The 
resulting bounds are often too wide to be of practical use. 
Goodman (1959) identified specific conditions for a regres- 
sion model under which ecological analysis could validly 
be used to draw inferences regarding relationships at the 
individual level. Langbein and Litchman (1978) consider 
some methods that can be applied when grouping is by the 


dependent variable and unit level variances are available 
for both the dependent and all the independent variables 
in the regression model. However, none of these approaches 
provide a general approach to the problem. 

Examining the bias for Son given in (2.36) shows that 
if we add By,(Lz, — Szz)Byz to Sy, the bias term due to 
the grouping variables would be removed. Now (2.31) 
implies that 


ELB, Ais eele=t0yz (2.38) 


where By, =) S,,)S,y: 


If the covariance matrix of z, S,,;,, from a unit level 
sample sy drawn from mp groups was available then the 
adjusted estimator 

Yyy(Z) = Syy + By(Szzs. — Sxz) By: (2.39) 
should remove the aggregation bias due to the grouping 
variables z, provided S,,,, is close to L,,. The source for 
Szzs, may be quite independent of the data used in Se 
and B,z. Steel (1985) shows that the adjusted estimator 
(2.39) can be obtained as the MLE of L,, (with the usual 
replacement of m — 1 by m etc.). If normality of the 
distribution of (y,z) applies, so is a simple random sample 
from the population and A,,, = 0. The adjusted esti- 
mator corresponds to the Pearson (1903) adjustment 
considered by Holt, Smith and Winter (1980) in the case 
of regression analysis and Smith and Holmes (1989) in the 
case of multivariate analysis. In these cases the adjustment 
is applied to statistics calculated from unit level data 
obtained from a sample whose design depends on the 
auxiliary variables. In our case the adjustment is applied 
to statistics calculated from area means and the auxiliary 
variables used in the adjustment include grouping variables 
as well as any design variables. The adjusted estimator 
of py is 


jy(z) = 9 + BL(%, — 2) (2.40) 
where Z,, is the mean calculated from Sp. 
From (2.34) and (2.38) we see that 
El py(%) | 5,Z,59,¢] = Hy + Bye (Zs =p). (2.41) 


Moreover, Steel (1985) shows that (2.36) and (2.38) 
imply 


BUby 2) lisiescle= 05S 18S = 2) By 


+ (A* — 1)A,,, + 0(m7') (2.42) 
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provided tr(Sz' S225) and Atr((Sz' Sx, — DSz' SY) 
are bounded, where S‘?) is defined similarly to $,, with 
n, replaced by n2/A. 

Comparing (2.42) with (2.35) we see that the component 
of bias due to the grouping variables has been adjusted 
to that associated with the use of S,,,,, if it had been 
available. The estimator adjusts for the aggregation effects 
that have acted through z. It also adjusts the effect of the 
sampling design from that associated with s to that asso- 
ciated with so. 

Suppose that the sampling design used to generate sq 
and the values of the auxiliary variables are generated from 
a superpopulation such that 


E[Z,, | Soe] = bz + 0(m ') (2.43) 
[Seco HS0,¢ | oe te Ong: ) (2.44) 
where mp is the number of groups in So. 
In such cases 
El fiy(z) | s,5,¢] = wy + O(me!) (2.45) 


BLD yz) | S,S0,€] = Lyy Je (n* ~ L) Avy z ey 0(m7!) 


(2.46) 
where 


m = min(m,mo). 


Conditions (2.43) and (2.44) would apply if the popu- 
lation z values across groups arose from a variance com- 
ponent model similar to model A and the sampling design 
for Sy depended only on the grouping but not any auxiliary 
variables. Sampling designs such as simple random sampling 
or equal probability cluster or multi stage sampling fulfil 
this condition. Use of census data, so that sp is the entire 
finite population is also applicable. 

It is thus possible to adjust for the bias due to the 
grouping variables provided some unit level sample co- 
variance matrix for z is available. The motivation for the 
approach is a situation where the predominant group 
effects can be attributed to selectivity or grouping effects 
acting through the grouping variables. The adjustment for 
the auxiliary variables removes the effect of the apparent 
intra-group correlation due to these variables. The adjusted 
estimator still has a component of bias due to Ay, , and if 
zis not effective in significantly reducing the intra-group 
correlations then this term can still be important. This 
approach therefore relies on choice of appropriate aux- 
iliary variables to reduce the intra-group correlations. 

If the sampling design for sg and the superpopulation 
model for z are such that (2.43) and (2.44) do not apply 
then Z,, and S-,,, can be replaced by estimators ji,,, and 
Ep in the calculation of the adjusted estimators jy (z) 


and Evi)! The resulting expectations of the adjusted 
estimators are given by (2.41) and (2.42) with Z,, replaced 
by fizs, and S,,,, replaced by ¥,,,,. There are a number of 
choices available for the estimators fi,,, and L,,,, calcu- 
lated from the sample sy. Smith and Holmes (1989) con- 
sider a range of model based and design based estimators 
that can be used. For example suppose the sample design 
used to obtain sp involved stratification according to the 
values of the vector of size variables x. Denote the sample 
inclusion probability for population unit / as IJ; and the 
associated probability based weight is w; = (II;) ~!. 
The probability weighted estimator of p, is Z,* = Lies, Wi 
z, and of Lz, 18 Szos) = Liesy Wizi%/ — Wo | Zot Ze where 
Wo = Lieso Wi- 

The Pearson based adjusted estimators of yu, and L,, 
are R50 az Bix (x, es Xs.) and S259 a Bixsy (Sxxu Te Sxxs9) 
B.xs, tespectively. Here x, and S\,,, are the mean vector 
and covariance matrix of the design variables in x in the 
finite population and Bzy5, = Syxs—1 Sxzso « 

Pobability weighted Pearson based adjusted estimates 
may also be considered, /.e., 23, + Bis, %, — X%,) and 
Sts a Brxso (Srxu iin Skxs9) Bays: 

Here x3, and S¢,,, are defined analogously to Z;, and 
Stz5, respectively and Bs,,, = Syxs¢ Stzs,- The approach 
taken so far is strongly model based and so model based 
estimators of », and L,, would be preferred. However, in 
practice the data available for use in the adjustment may 
comprise published p-weighted estimators of means and 
covariances obtained from the sample so, which is inde- 
pendent of s. Thus 


Ep, | Beso | Z,¢] ai zy 
Epi | zc] = Sey 


where Z, and S,,,, are the mean vector and covariance 
matrix of the auxiliary variables in the finite population 
and E pp tepresents the expectation with respect to repeated 
sampling using the sampling design employed to obtain 
So, i.e., the randomization distribution. Thus from (2.41) 
and (2.42) 


El p,(%) | 5,z,c¢] = By + Byz (Zu = Pz) 
EL) | $,Z,¢] = pee oh [opal heey a re ee 
+ (f* — 1)A,,, + 0(m7). 


These expectations are taken over the statistical model 
generating the y values and the randomization distribution 
associated with so. In practice Z, and S-,,, will be very 
close to wz, and L., respectively. 
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3. IDENTIFYING GROUPING VARIABLES 


In the previous section we introduced a set of auxiliary 
variables, z, which characterised the area differences and 
which were used to adjust the aggregated analysis to reduce 
the aggregation bias. If the auxiliary variables were totally 
successful then A,, , would be reduced to zero and the 
adjustment method would remove the aggregation bias 
completely. In practice the auxiliary variables for which 
Ay, = 0 are unknown. Also we will be restricted to sets 
of variables for which area level means are available as part 
of the data set under analysis and for which an estimate 
¥., of the unit level covariance matrix is available. Basic 
demographic information and housing variables commonly 
available from the Census may be used. However these 
variables may not fully characterise the grouping process 
and so they may not explain as much of the between area 
difference as we might wish. 


3.1 An Analysis Strategy 


In practice the grouping variables will not be known. 
We need a strategy for identifying adjustment variables 
for which an estimate of the unit level covariance matrix 
is available and which account for group effects. One 
strategy involves the following steps: 


1) Identify a set of variables that cover the same subject 
area as the variables of interest, but for which both area 
level and unit level data are available for some period 
in the past. Previous Census data may be suitable. 


2) Add to this set, variables (such as demographic and 
housing variables) which are candidate z variables since 
they are known to be strongly associated with area 
differences. Estimates of both the area level and unit 
level covariance matrices must also be available for the 
same period in the past. 


3) Carry out an analysis of these data to identify the 
variables which account most strongly for the area level 
effects among the variables of interest. This analysis, 
which we term a CGV analysis, will be described below. 


4) Identify from (3) a set of adjustment variables which 
are available within the current data set and for which 
the current unit level covariance matrix is available 
from some source. 


5) For some variables of interest it may be possible to 
obtain estimates of unit level variances or covariances, 
from published tables for example. From these calculate 
aggregation.effects O,. = Saq/Saq,0l Op = Srp lSab- 


6) Use the variables identified in (4) to adjust the aggregate 
analysis for the variables of interest and check the 
adjusted aggregation effects corresponding to (5) to 
monitor the success of the adjustment. 


3.2 The Ideal Grouping Variables 


We first consider the ideal set of grouping variables that 
could be used for adjustment so as to identify the appro- 
priate (CGV) analysis that could be followed for the 
analysis of aggregated data using the strategy outlined 
above. 

Let us suppose that for the complete set of variables of 
interest we have the area level variance-covariance matrix 
Sy, and the unit level variance-covariance matrix Syy., 
based on a sample s;. Of course if this occurred in practice 
the aggregation problem would disappear since we could 
discard iy and simply use S,,,,, as an estimate of Ly,. 
However there are three reasons for considering this 
situation. Firstly it helps to throw light on the grouping 
structure which determines the relationship between Age 
and Sj,,. Secondly it may be that S,, and S,,,, are avail- 
able at some point in time such as census day but that 
further analysis of a new version of Syy is to be based on 
inter-censal data when S,,., is unavailable. If the grouping 
structure persists over time, as we might expect, then the 
analysis of the census day versions of Sy and S\,,, might 
help the subsequent inter-censal analysis by identifying the 
key variables that explain a large proportion of the aggre- 
gation effects. These possibilities underpin the strategy 
outlined in section 3.1 above. Thirdly if the variables in 
y cover a large range of socio-economic and demographic 
variables, as occurs in the census, then the key variables 
that account for the grouping effects for the variables may 
also explain much of the grouping effects of other socio- 
economic and demographic variables. Note that the two 
samples s and s, may be identical but in general do not 
need to be. For example s may correspond to an adminis- 
trative source which is effectively a census that provides 
aggregate data for geographic areas, and s, is a sample 
survey from which individual level data are made available 
without any geographic identifiers. 

To help identify the important variables associated with 
the grouping Steel (1985) suggests that 6;, ..., Ce the 
eigenvalues of S;,5, Syy, be calculated as well as the 
matrix D, = [d,, ..., d,] such that 


A 


Dy Sy, D, = diag(6,) and Dy Sys, Dy = I. 


The variables defined by the transformation 


A 


ris 
uj = Vy Ji 


successively have maximum ratio of between group to 
sample total variance and have zero sample correlation at 
the unit and group level and unit level sample variance of 1. 
These variables are called the sample Canonical Grouping 
Variables (CGVs). The sample CGVs have the maximum 
intra-group correlation. Note that tr(Sj4, Sy) = Lx I 
can be defined as the multivariate aggregation effect. 
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Note that the matrix Dy, will exist even if S,,,, and S,, 
are based on different samples so long as the former is 
positive definite and the latter is positive semi-definite. 
Furthermore the variances of the CGV’s will be non- 
negative. However, when s and s; are distinct it is possible 
that the maximum variance of a CGV could exceed 
(N — 1)/(M — 1) which is the maximum possible aggre- 
gation effect. In this case the CGV has an implied negative 
within group variance component. For our purposes this 
may not matter since we are interested in identifying 
important grouping variables but in principle the offending 
variance of the CGV could be set to its theoretical maxi- 
mum. The sample CGVs are obtained from the eigene- 
vectors of Ay, = Sn Syy. If s and s,; are the same sample 
then A,, is the sample regression coefficient for the 
regression of the group level means on the unit level values 
calculated over the unit level sample. In this case the 
sample CGVs are in fact the sample canonical variates 
relating the unit level and group level data and 6, are the 
sample canonical correlations. 

Having calculated the CGVs the difference between the 
sample group level and unit level covariance matrix can 
be expressed as 


iS. — Sys, = YE (6, — 1) bb 
k 


where ¢;, is the vector of sample covariances between the 
k-th CGV and the original variables. Hence the difference 
between the group level and unit level covariance matrix 
can be partitioned into k orthogonal elements, one for 
each CGV. 

For the covariance between y,, and y;,, the difference 
between the sample group level covariance, 5,, and unit 
level covariance s,, (where 5,, and S,, elements of at and 


Syys,» Tespectively) is 


= V, an Para 
Hie SE Ne al Pi) ep il Cee Yar opy 
k 


where pax = bx/SZ is the sample correlation between 
the a-th variable and the k-th sample CGV. 

If the first g sample CGVs are used to calculate an 
adjusted group level variance matrix, i.e., &gj = D, Vj 
where D, = [d,, ..., dy], are used as the auxiliary 
variables 


Myy (Ug) = Syy + Bg (Sug ugso a 


then the first g terms of the decomposition are removed 
Les, 


D 
Lyy (Gg) = Sys, + VY) (Oe — Wd, 4 
k=qt+1 


and tr(Syy4, Lyy(fig)) = LR=q+19x- - In fact use of the 
first g CGVs provides the matrix of rank g that minimizes 


AsSyeies Es (iy) ll. Hence by examining the quantities 


p p 
Dy 6, and 1+ De (6, — 1)b2% 


k=qt+l k=qtl 


Ot — 10 sense 


it is possible to examine how the proportion of the overall 
aggregation effect and the aggregation effect for each 
variable can be explained by the first g sample CGVs. 
The preceding analysis will suggest how many dimen- 
sions are required to effectively explain and hence remove 
a specified amount of the aggregation effects. Moreover 
by looking at the loadings of the original variables in the 


‘CGVs, it should be possible to identify which variables 


play the major role in ‘‘explaining’’ the aggregation effects 
of the other variables. It is these variables that researchers 
should concentrate on obtaining unit level data for, to use 
in the adjusted estimator. 

These results have some important implications for the 
use of group level data supplemented by limited unit level 
data, since they open the way to combining sample survey 
data and group level data from one or more sources and 
suggest a strategy for the analysis of group effects and 
group level data. 


4. SOME EMPIRICAL RESULTS 


We illustrate the ideas of the previous sections with an 
analysis of the 1991 UK population census data for the 
Local Authority District (LAD) of Reigate, Banstead and 
Tandridge. The LAD population is 188,700 people con- 
tained in 371 EDs giving an average number of people per 
ED of #7 = 508.6. Group level data are available on a 
complete count basis for each ED in the LAD from the 
Small Area Statistics (SAS) data file. Corresponding unit 
level data for the LAD are obtained from a 2 per cent 
Sample of Anonymized Records of individuals (SAR). The 
records in the SAR cannot be identified with any specific 
ED within the LAD thus in this situation we have S,, 
based upon complete data for each ED from the SAS and 
we have an estimate of S,,,, based on a 2 percent sample 
from the SAR. The following analysis is based upon 
16 census variables for each person. 

For each variable the group level data and the unit 
level data were used to calculate the aggregation effect, 
OF —"S,./ Sime une patanleter On, = A772, 4 denned 
on the appropriate diagonal elements of A,, and L), is 
the intra-group correlation for the a-th variable. An esti- 
mate 6,, of the intra-group correlation can be obtained 
from (2.18) since O, = 1 + (A* — 1) b,9. The results 
for the variables are given in Table 1. The intra-group 
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Table 1 


Aggregation Effects and Intra-class Correlations for 
Census Variables in Reigate LAD 


Aggregation Intra-class 


Effect Correlation 

Persons aged 18-29 9.20 .016 
Persons aged 30-44 4.56 .007 
Persons aged 45-59* 5.97 .010 
Persons aged 60 and over* 17.17 032 
Female 1.08 .000 
Non-white* 8.29 .014 
Married 6.24 .010 
Limiting long term illness 7.24 .012 
Persons employed full time 8.55 .015 
Persons unemployed 227 .003 
Other employment status 11.19 .020 
Head of h’hold born UK 4.48 .007 
Head of h’hold born New 

Commonwealth 3.59 .005 
Migrant head of household 9.04 .016 
< 1.5 persons per room: density 27.96 .053 
Persons in 0 car households 32.98 .063 


* Selected for adjustment variables. 
Source: Reigate and Banstead; Tandridge LAD 1991 census data. 


correlations are generally small but the number of obser- 
vations in each ED implies that the aggregation effects can 
be high (see the comment following equation (2.18)). 

Figure la shows a plot of the group level correlation, 
F,», against the individual level correlation, r,,, for every 
pair of variables. Note the strong aggregation effects 
which are revealed through the characteristic S-shaped 
plot. Small correlations at the unit level are generally 
magnified so that for most cases | 7,, | is much larger 
than | rz» |. 
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Figure la. 
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Figure Ic. 


Since in this case we have Sys and S\,,, we may carry 
out a canonical grouping variable analysis so as to under- 
stand the more important features of the grouping struc- 
ture. Table 2 shows the loadings on the 16 variables for 
the first five canonical grouping variables which together 
account for 89% of the multivariate aggregation effect. 

The first CGV has high loadings on high density occu- 
pation and car (i.e., auto) access and might be interpreted 
as a socio-economic factor. The second CGV has high 
loadings the variables indicating people in the two oldest 
age groups. It is noticeable, also, that the proportion of 
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Table 2 
First Five CGV’s for Variables in Table 1 


CGV1 CGV2 CGV3 CGV4 CGV5 


Persons aged 18-29 0.4 0.3 0.9 Lay 0.1 
Persons aged 30-44 0.1 0.5 0.36 1.0 0.2 
Persons aged 45-59* —0.1 12s 02, 1.0 0.1 
Persons aged 60 and over* 0.3 2.2  —-0.5 2.6 0.9 
Female 0.1 0.0 0.0 0.3 0.1 
Non-white* 0.5 —-0.4 1.4 —1.1 S72 
Married 0.2 0.5 0.4 -0.8 -—-—0.1 


Limiting long term illness 0.3 0.1 —0.2 0.2 0.3 
Persons employed fulltime 0.7 —0.3 0.2 572 0.4 
Persons unemployed 0.7 0.0 —-0.1 0.0 —0.4 
Other employment status 0.1 0.1 0.0 -0.22 -0.1 
Head of h’hold born UK 05 —0.1 -—1.0 0.4 0.2 
Head of h’hold born New 


Commonwealth 0.0 -0.1 —0.3 0.1 0.6 
Migrant head of household 0.2 0.1 1.4 0.6 —1.3 
< 0.5 persons per room —1.4 0.3 12 -0.7 -—0.2 


Persons in 0 car households 2.2 0.6 0.8 -1.9 —0.7 


* Selected for adjustment variables. 
Source: Reigate and Banstead; Tandridge LAD 1991 census data. 


non-white heads of household contributes to the later 
CGV’s. As might be expected, variables such as propor- 
tion Female, that exhibit almost no intra-group correlation 
and hence no aggregation effect make virtually no con- 
tribution to the CGV’s. Such variables do not vary across 
areas and hence generally have no explanatory power. 

In usual practice a CGV analysis will not be possible 
since if S,, was available there would usually be no need 
to carry out an aggregate analysis. However the CGV 
analysis suggests variables that may be important since 
they load highly on the first few CGVs. 

It is well known in the UK context that housing tenure 
variables (which are not contained in the 16 variables of 
interest) have a powerful association with a wide variety 
of socio-economic, attitudinal and health variables. There 
are strong reasons for assuming that using these as aux- 
iliary, z, variables for adjustment would account for a 
substantial proportion of the first socio-economic dimen- 
sion and may act in place of the density of occupancy and 
car access variables that are seen to be important for the 
first CGV. The other reason for considering those variables 
is that if the present analysis is to act as an illustration of 
what might be achieved in other situations then basic 
tenure and housing variables are more likely to be available 
as adjustment variables than density of occupation and car 
access. In the light of the CGV analysis and in the spirit 
of identifying a small number of adjustment variables 
which could be expected to be available in many situations, 
we identify a set of seven potential adjustment variables. 
These are the three variables of interest identified in 
Table 1 identified by an asterisk (Age 45-59, Age 60+, 
non-white) and the four housing variables listed in Table 3 
together with their aggregation effects and intra-cluster 
correlations. 


Table 3 


Aggregation Effects and Intra-class Correlations for 
Household Level Variables in Reigate LAD 


: Aggregation Intra-class 
Neva Effect Correlation 
Tenure: LA Rented 133.43 0.261 

Owner Occupier 90.83 0.177 
Stock: Det/semi/terrace 90.03 0.175 
Good Amenities $9.52 0.113 


Source: Reigate and Banstead; Tandridge LAD 1991 census data. 


In what follows the group level covariance matrix for 
the original 16 variables will be adjusted by the unit level 
covariance matrix for 7 z-variables (three of the basic 


‘demographic variables in the original set and four house- 


hold variables). 
Two overall measures of the effectiveness of the adjust- 
ment were calculated. The first is 


ee tt (Syys, ay (z)) Tad 
tr ie Syy) eal 


which is the reduction in the multivariate aggregation 
effect and the second is 


lS 


ys) — Syy ll — IS), — Eyy(Z) ! 


VS yatere ied 


which shows the reduction in the generalised distance 
between the unit level and group level covariance matrices 
before and after adjustment. 


Table 4 


% reduction in 


Z-variable No. of Multivariate 


Combination Variables orcron Generalised 
Effect Distance 

60+ 1 16 24 
45-59, 60+ 2, 38 53 
Tenure 2 30 21 
Stock 2 31 19 
45-59,6—+, NW 3 44 54 
45-59, 60+, tenure 4 Sy) 71 
45-59, 60+, stock 4 Si] 69 
45-59, 60+, tenure, NW 5) 63 72 
45-59, 60+, stock, NW 5 62 70 
45-59, 60+, stock, 

tenure, NW 7 68 75 
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Table 4 shows the effect of using various combinations 
of variables for adjustment of the aggregated analysis. The 
two age variables are clearly important (accounting for 
38% of the multivariate aggregation effect and 53% of the 
generalized distance) but the Tenure or Housing Stock 
variables are also important. When Tenure or Housing 
Stock are used in conjunction with age the percentage 
reduction in either measure is close to the sum of the effects 
of the variables separately showing that age and Tenure 
or Housing Stock are acting as distinct adjustment vari- 
ables. Obviously the greatest success is achieved by 
including all 7 adjustment variables and accounts for 68% 
and 75% respectively of the two aggregation measures. 
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These results show that around 70% of the aggregation 
effects have been removed by the adjustment. Figures 2a 
and 2b show the effect of adjustment by these variables. 
In Figure 2a the vertical axis contains | 5,5 — Sqps, |, the 
absolute bias for the group level covariance for each pair of 
variables. The horizontal axis contains | L45(Z) — Sans, | 
the absolute bias of the adjusted estimator. The hollow 
symbol is used for variances of the y variables, and the 
solid symbol is used for covariances. Almost all of the 
plotted values show that the biases after adjustment are 
smaller (often much smaller) than the original bias. In 
almost all cases the adjustment has had a substantial 
improvement. Figure 2b shows the corresponding plot for 
correlations rather than covariances. (Correlations of 
Yqs¥q have obviously been omitted from this plot.) Again 
there is a strong improvement with the residual bias after 
adjustment being much smaller than the original bias for 
the group level analysis. The results are not as successful 
as for the covariances, since in some cases small biases for 
the group level analysis have been made worse. In this case 
the adjustments are applied to the covariance and the two 
variances used in each correlation coefficient. There is 
more potential for the relative changes in each component 
to lead to a correlation which is worse than the original. 
However, almost all of the large biases at the group level 
have been improved. 

Figure 1b shows the plot of the adjusted group level 
correlations, 7,,(z), obtained from » 'yy(z) against the unit 
level correlations and can be compared with the original 
unadjusted plot in Figure la. The characteristic S-shaped 
curve shown in Figure la has been replaced by a plot of 
points which lie about the line 7,,(z) = rg», as we would 
want if aggregation bias is removed. 

Figures 1b, 2a and 2b show that a substantial reduction 
to the aggregation effect can be achieved by using 4 housing 
variables and 3 of the original y variables. This implies 
adjusting the original 120 variances and covariances in the 
16 x 16 matrix by 21 variances and covariances for the 
z variables. As an illustration of what might be achieved 
with minimal information we reduce the adjustment 
variables to the four involving age and Tenure. From 
Table 4 we see that these account for 57% and 71% of the 
two measures of aggregation. Figures 3a and 3b show the 
corresponding plots to Figures 2a and 2b for this case. 
Figure lc shows the plot of the adjusted correlations 
using 4 variables against the individual level correlations. 
Obviously the adjustment is not as successful but it is 
encouraging to see what can be achieved with so few 
adjustment variables. As a further measure of the effect 
of the adjustment the median absolute difference between 
F,» and r,, was 0.186. After adjusting by 4 variables this 
was reduced to 0.126 and after adjusting 7 variables to 
0.090. The corresponding median values for | 5, — Sap | 
were 0.173, 0.039 and 0.017 respectively. 
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5. CONCLUSIONS AND DISCUSSION 


A model for grouped populations has been proposed 
which leads to a decomposition of the bias observed in 
group level analysis based on covariance matrices into two 
components. The first component is due to the grouping 
variables and the second is due to the residual intra-group 
correlations between the y variables given the grouping 
variables z. This decomposition provides an understanding 
of the magnitude of aggregation effects. It also provides 
a way of removing the bias due to the grouping variables 
if additional information about the unit level covariance 
matrix of the grouping variables is available. 


In many countries there are many group level data 
available at different levels of aggregation from the census 
and many other sources. The development of Geographic 
Information Systems will increase the availability of such 
data. It is important to analyse and decompose the group 
effects and the theory developed and the strategy proposed 
here provide a framework for achieving this. A proper 
understanding of which variables explain most of the 
group effects, and therefore should be used in adjusting 
ecological analyses, will open the way to making use of 
aggregated data. 
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Linearization Methods for Single Phase and Two-Phase Samples: 
A Cookbook Approach 


DAVID A. BINDER! 


ABSTRACT 


There are a number of asymptotically equivalent procedures for deriving the Taylor series approximation of variances 
for complex statistics. In Binder and Patak (1994) the theoretical justification for one class of methods was derived. 
However, many of these methods can be derived for practical examples using straightforward techniques that are 
not clearly described in Binder and Patak. In this paper we give a ‘‘cookbook’’ approach that can be used for many 
examples, and that has been shown to have good finite sample properties. Normally the method of choice becomes 
clear through arguments such as model-assisted methods or linearizing the jackknife; however, using our approach 
yields the desired results more directly. As well, we present new results on the application of these techniques to 


two-phase samples. 


KEY WORDS: Complex surveys; Variance estimation; Ratio estimator; Regression estimator; Wilcoxon rank sum 


test; Estimating equations. 


1. THE METHOD 


The derivation of the asymptotic variance for a wide 
class of estimators from complex survey samples is now 
well established in the literature, at least to a first order 
approximation. However, there are a number of competing 
estimators of the variance, all of which are asymptotically 
equivalent. In this paper, we discuss a simple derivation 
of one of the most favoured of these estimators in a gen- 
eral setting. This simple derivation is useful for practi- 
tioners, who may be baffled by the choices available, and 
need a quick solution to the problem. 

We start with a simple example of the approach using 
the ratio estimator of a population total. Here the esti- 
mator is 


RoRXG (1) 
for 


R = Y/X, and Y= )) wey, 
kes 


where, s is the set of indices corresponding to sampled 
units and w, is the sampling weight, normalized so that 
¥w, is an estimator of the population total; e.g., w, = 
1/x,, where z, is the first order inclusion probability. 
The definition of X is analogous to that of Y. Applying 
total differentials to both sides of (1), we obtain 


(dYp) = (dR)X, (2a) 


where 


Y a 
R=. 2 
(dR) ¥ (dX) (2b) 


foray ts Horta 97 
= 3 ((d¥) — R(dX)). 


We note that, in general, the total differential for 
T = o(Yio:: 2) dis given by 


dg(Y) 


at) = |S | ato. 


Although we could have avoided using R in (1) by 
simply defining 


thus removing the need for explicitly defining (dR) in (2b), 
we did so to make the more complex examples, to be given 
in Section 1.2, clearer. We also note that (2a) does not 
include the total differential of X, the population total of 
the x-variable, since X is assumed to be fixed and known. 

The next step is to replace all total differentials of 
estimated quantities by deviations from the their respective 
expected values. On the right hand side, we substitute for 
(dY) the expression (Yw,y;, — Y), and so on. For the 
quantity of interest, Yr, we replace dYp by Yr — Y. 
From (2), performing this step, yields 


Y,—-Y= 2|(Dee ”) -&(Ymx-%)]. (3) 


! David A. Binder, Director, Business Survey Methods Division, Statistics Canada, R.H. Coats Building, 11 ‘‘A’’, Ottawa, Ontario, Canada, K1A OT6. 
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We see that this expression contains a number of 
weighted estimators - those that explicitly show their 
dependence on the w,’s, (Yw,y, and }w,x;,) and those 
where the w,’s are implicit in the expression (X and R). 

For the last step, we isolate z,, defined by rewriting 
(3) as 


ars ye WyZe + other terms not depending 
explicitly on Wx. 


Here, we obtain 
y 4 
San — Rx;,). 4 
URIS (Ve k) (4) 


The justification for ignoring the terms not depending 
explicitly on w, will be given in Section 4. Note that 
Yw,zZ, has the form of the estimate of the population 
total of the variable z. 

Now to obtain the variance of Yr, we insert the new 
variable z, into the k-th sample record, and use a standard 
procedure for estimating the variance of a total, applied 
to this variable. It is assumed that a variance estimator 
with good properties is available for the sample design 
under consideration. 

A summary of the method in general is the following: 


1. We let the estimator of T be T and take its total differ- 
ential. We assume that 7 is asymptotically design 
consistent. 


2. We replace total differential of T, dT, by T — T. We 
replace all other total differentials of estimated quan- 
tities by the deviation from their respective expected 
values, where we substitute for (dY) the expression 
(YS wry, — Y), and so on. 


3. The last step is to isolate z,, when we rewrite the result 
of Step 2 as 


a po > Wez_ + other terms not depending 
explicitly on Wx. 


4. Finally, to obtain the estimated variance of T, we insert 
the new variable z, into each sampled record, and use 
the standard procedure (known to have good properties) 
for estimating the variance of a total, applied to this 
variable. 


1.1 Simplest General Case 


For one-phase samples, a simple general case is where 
the estimator can be expressed as a differentiable function 
of the estimated totals for certain survey variables, some 
of which may be derived variables at the final sampling 
unit level. In this case our approach gives: 


leis 2(Yi, ty Yen) 


dg(Y) 


(dT) = | rr | car 


- ag(Y 
fo a ye ar | (Eve ¥,) 


(5) 


ag(Y ag(Y)]’ 


In what way is this formulation different from standard 
Taylor methods? The main difference is how expression 
(5) is treated. In standard methods, the partial derivatives 
are evaluated at their expected values before z, is derived. 
Then, for those components of z, that are unknown, an 
estimator is substituted. For the ratio estimator, (1), this 
would result in X/X disappearing from z, in (4), since 
when _X is replaced by its expected value, X¥/X becomes 
unity. The R remains in the expression, as it is used to 
estimate R, which is needed in the usual derivation of Z,. 

Kott (1990) argues that the variance estimator for the 
ratio which we have derived has good conditional prop- 
erties compared to the estimator which leaves out the 
factor X/X. A number of others have come to similar 
conclusions. Rao (1995) showed that the method agrees 
with that obtained from the linearized jackknife. Our 
conjecture is that since the partial derivatives in expression 
(5) are evaluated at Y rather than Y, the linearization is 
‘“closer’’ to the original statistic, T, so that the resulting 
variances have better properties. This is, of course, not a 
technical statement, but rather an intuitive justification of 
the method. 

We note that in expression (6) for z,, all the terms are 
directly observed from the sample, so that no substitution 
of estimators for unknown quantities is needed. 


1.2 The Case with Extra Parameters 


For many examples, the estimator is most easily defined 
in terms that include the use of parameters that are only 
used to simplify the definition of the parameter of interest. 
For the ratio estimator, R is an example of such an extra 
parameter. In this case, an explicit equation for the esti- 
mator of the extra parameter is available. The general method 
in the presence of extra parameters may be written as: 
Fa eee 


ty Voph)» where At=¢5(Y)0- 42,6 


dg,(Y,) 


: df; Nhl 
ONO : BAR aY a 
ian =| ay, ON | 


|@% + | 
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where 


where 


Oe CVPR)’ d2,(Y,4) 1‘ [dg.(¥ 
eral ye +| ae _ ( |v @ 


For the case where the extra parameters are defined 
only implicitly through estimating equations, we have the 
following generalization: 


T = eC Ys! ADS 
where 


Ae OK (8) 


; dg(¥,X - dg(Y,N) | tees 
(at) =\ lee (d¥;) + (eae (dX), 


I 


where by taking the total differential of (8) and isolating 
(dX), we have 


GE] 2G) Ge) 


= yy Meza + nes) 


pete rozt) ty aioe) (00) aif el) 4 (10) 
paeel| ope lereandtl an | ak rie 
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We see, of course, that (10) is a generalization of the 
previous forms for z, given in (6) and (7). 


2. OTHER EXAMPLES 


Expressions (6), (7) and (10) above are displayed only 
for the purpose of giving the specific formulae for the 
various cases. However, in practice, we recommend using 
the basic steps from first principles. To demonstrate this, 
we give two examples: one is the familiar Generalized 
Regression Estimator (GREG); the other gives some new 
results for the Wilcoxon Rank Sum Test statistic for data 
from complex surveys. 


2.1 Generalized Regression Estimator 


The usual Generalized Regression Estimator, given, for 
example, in Sarndal, Swensson and Wretman (1989), may 
be written as 


Varig =) + Palinsad (11) 
where the extra parameter B is defined as the solution to 


Y) WeXK (Ye — XEB)/ce = 0, 
k 


where c, is the factor to allow for heteroscedastic variance 
in the regression model. This is equivalent to 


SS 0, (12) 


with obvious definitions for S,, and S,,. Taking total 
differentials in (12) we get 


(dS,x)B + Sy(dB) — (dSy) = 0, 
so that 
(dB) = Sx'[(dSy) — (dS.)B1. 
Therefore, we have 
6-6 S We Six (Xe(Ye — XB] /Ce + +++ 
Now, taking total differentials of (11), we have 


(dY¥crec) = (dY) — B’ (dX) + (dB)'(X — X) 


(dY) — B’(dX) + 
[(dSy) — B’ (dS,x) 1Siq' (X — X). 
After some algebraic manipulation, we obtain 


Yorrc = y=)) w,ex[ 1 + xi S53) (X — X)/cx] Piece ads 
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where e, = y, — xj 8. We, therefore, define 
Zp =er[ XPS, OX X)/cel- 


Taking the variance of the estimated total of this 
z-variable is identical to the variance proposed in Sarndal, 
Swensson and Wretman (1989). There, it is argued on the 
basis of the validity of the regression model, that this 
variance is preferred to other Taylor expansion estimators 
for the variance. We see that the derivation of this 
z-variable is natural in our approach. 


2.2 Wilcoxon Rank Sum Statistic 


We now show how our method works in the case of a 
more difficult non-standard case. We assume that our 
sampled units belong to one of two subpopulations which 
we name Population 1 and Population 2. We define 


Inifixesys lif ké Pop. 1 
I{x<y}= i 


a 
0 otherwise, O otherwise. 


We let 


N,(t) = ye Wr Ol {xy = Tt}, 


kés 


which corresponds to the estimated number of Population 1 
units that have values less than or equal to t. We define 
N,(t) analogously. We denote N; = N;(o), the estimated 
number of units in Population. Now a weighted version 
of the Wilcoxon Rank Sum Test statistic is 


re | [N,(t) + Na(t)]4N,(t). (13) 
0 


This corresponds to the weighted sum of the ranks from 
Population 1 among the weighted ranks of the combined 
sample. To derive the asymptotic expected value of Ty in 
(13), we let N;(t) = E[LN;(t)] for i = 1, 2, and substitute 
N;(t) for N;(t) in (13). We then define F;(t) = N;,(t)/N,;, 
where N,; = E(N;) and we give the null hypothesis as 
F(t) = F,(t) = F(‘), say. This results in the asymp- 
totic expectation being 


1 
0 


Note that in the case of independent samples of size N, 
and N, from Population 1 and Population 2, respectively, 
where each population is assumed to have a continuous 
distribution function and the samples are taken using 
simple random sampling, the exact expected value for Ty 
in (13) is N;(N; + Nz + 1)/2. 


We consider the statistic 


2 co 2 i N,(N, + N. 
Ty, = | LN, (1) + Ny(t)]dM(t) — ae 
0 


We use A rather than d to denote the total differential, 
since d is used under the integral. Therefore, we have 


Cp [tame + AN} (t) ]dN,(t) 
(0) 


+ | [Ni (t) + N(t)]dAN, (2) 
0 


_ (AN) (N, + Np) + Ni (AN, + AN5) 
5 


Continuing with our usual approach, we have 


j praeny Pr [ (Dmetto < 4) dN, (t) 
0 
ate yy WrOKL Ny (X_) + N(x) ] 


Yo wed (N, + No) + NY) we 
oe ee + 
Z 


so that 


a = 3 w; jf xq < xj} + SL N(x) + No (xx) ] 
J 
" 6,(N,; + No) + Ni (14) 
2 

Weare not aware of this result previously being docu- 
mented. It can be shown that when the null hypothesis is 
true and we select independently from two populations 
using simple random sampling, where the populations 
have continuous distribution functions, the variance we 
obtain from the z-variables in (14) is asymptotically 

equivalent to the usual classical formula. 


3. TWO-PHASE SAMPLES 


The method described above extends quite easily to the 
case of two-phase samples. For example, consider the two- 
phase ratio estimator of the population total, given by 


XY = RK, (15) 
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where X"') = Yw,x;, is the first phase estimate of X 
based on first phase weights {w,}, and Y and X are the 
estimates of Yand_X, respectively, based the second phase 
sample units with weights {w,w2,}, where w, is the 
weight assigned to the selected second phase unit, condi- 
tional on being in the first phase sample. In particular, 
letting 


we! if the k-th unit is in the second phase sample, 
o 0 otherwise, 


we have 


Lewis io WkW2KkVk > 
kes 


where s is the set of indices corresponding to units in the 
first phase sample. 
Taking total differentials of (15), we have 


¢ (1) 
(dY¥R(2)) = er ) [(aY) — R(dX)] au R(dX"). 


We now replace the total differentials by weighted sums 
over first phase units: 


Yr2) -Y = 
OE) 2 . 
De Wr jacwe(=Z-) Oe, = RX) + Rx, | a Phe 
kés 
so that 
xa) < i 
z= ace ( 5% ) Oe Xe tek Xe (16) 


We see that the steps we have taken are essentially the 
same as in the one phase sample case. However, it is impor- 
tant to note that now z, contains the random variable, a,, 
that is used to indicate whether or not the sample unit is 
in the second phase sample. This is needed to compute the 
two phase variance estimator. 

Variances obtained from the z-variable in (16) are iden- 
tical to those given in Rao and Sitter (1995), who used a 
linearization of the jackknife to obtain their results. 

Extensions to other estimation problems in two phase 
samples are straightforward. Suppose, for example, that 
(Y,. 24. .,,),,), are estimates of (Y;, ..., Y,,) from the 
second phase samples, and that (X{!), ..., Xi") are 
estimates of variables available only for first phase sample 
units. We suppose that a set of extra parameters, \, are 
defined only in terms of the units in the second phase, and 
that the variable of interest is defined in terms of these 
extra parameters and the X{")’s. Formally, then, we have 


Zi 


and 
T= g(X™)d). 


Taking total differentials, we have as in (9), 


2 dU >t ROU AY 
d ey a ee a dY), 
(dh) lea eae ) 


so that 


al Laxl Loe] (2 
= an mn ea AWrWaKVk VAN Te 
Be ax avy|\« 


Therefore, the general expression for Z, is 


ES gta (eck Wea (PA ene Babes 


It then becomes necessary to put the z-variable into the 
algorithm that estimates the variance of the estimator of 
a total from a two phase sample. 


4. JUSTIFICATION 


The technique we have described can be considered as 
a direct result of the formulation given in Binder and Patak 
(1994). We will summarize one of the main results in that 
paper. Suppose we are interested in parameter 0, defined 
as the solution to 


0, (8,9) = Y) wets (94,8,Xo) = 0, 


kés 


where Xa, is the estimate of an extra parameter, defined as 
the solution to 


U>(6,X4) = oD Wytla (V,8,A9) = 0, 


kés 


for a given 6. Through an argument based on removing 
extra parameters for problems of testing hypotheses on 6, 
Binder and Patak recommend basing inferences about 6 
on the variable 


aU, | [ aU, ] -! - 
u* = Uy (y,0, Xo) Sa le | =| U2(¥,0,Xo)- (17) 
6 
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In particular, two-sided confidence intervals for 6 are 
to be based on 


U? (6, Xo) A 
aS ail F 
f | Ww a xa al } 


where W is the estimated variance of the estimator of a 
total when the variable being estimated is u*. 

We let uw; = g(Ay,A2) — 0. The kernel of the estimating 
equations for the y-totals will be given by uw, = y — ry 
and the kernel of the estimating equations for A, is given 
by Uy7 (Aj, 2). We let 


a Ur, YowWaj os 
0= Sm] |-| Nuss IF where N=)) wy. 


ur2 


After some algebra, from (17) the variance of interest 
is the variance of the estimated total based on the variable 
u*, given by, 


dg (A,, Ao) < 
any 


[g(r Ao) ] * FAe22 (Ar Aa) ] ~! FOte22 (Ai, Aa) , 
OX» OX any 


+ constant terms. 


This is equivalent to expression (10), thus showing that 
the methods here are consistent with those in Binder and 
Patak (1994). 
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Jackknife Linearization Variance Estimators Under Stratified 
Multi-Stage Sampling 


W. YUNG and J.N.K. RAO! 


ABSTRACT 


Variance estimation for the poststratified estimator and the generalized regression estimator of a total under stratified 
multi-stage sampling is considered. By linearizing the jackknife variance estimator, a jackknife linearization variance 
estimator is obtained which is different from the standard linearization variance estimator. This variance estimator 
is computationally simpler than the jackknife variance estimator and yet leads to values close to the jackknife. 
Properties of the jackknife linearization variance estimator, the standard linearized variance estimator, and the 
jackknife variance estimator are studied through a simulation study. All of the variance estimators performed well 
both unconditionally and conditionally given a measure of how far away the estimated totals of auxiliary variables 
are from the known population totals. A jackknife variance estimator based on incorrect reweighting performed 
poorly, indicating the importance of correct reweighting when using the jackknife method. 


KEY WORDS: Generalized regression estimator; Jackknife variance estimator; Linearized variance estimator; 


Poststratified estimator. 


1. INTRODUCTION 


Large-scale sample surveys often use stratified multi- 
stage designs with large numbers of strata, L, and 
relatively few primary sampling units (clusters), ,(= 2), 
sampled within each stratum. Within each cluster, some 
elements (ultimate units) are sampled according to some 
sampling method. We do not specify the number of stages 
or the sampling methods used after the first-stage sampling, 
but we assume that subsampling within sampled clusters 
is performed to ensure unbiased estimation of cluster 
LOlalsse ly cue" In crit, R= Vp vanes. 

From the specification of the survey design, basic 
weights w,;,( > 0), attached to the (hik)-th element, are 
obtained. Often these basic weights w,;, are subjected to 
poststratification adjustment to ensure consistency with 
known totals of poststratification variables. In the case of 
a single poststratifier, the weights are ratio-adjusted to the 
known population counts (e.g. , age-sex counts). To handle 
two or more poststratifiers with known marginal popula- 
tion counts, the weights w,,, can be calibrated through 
generalized regression (see section 4), as in the Canadian 
Labour Force Survey(CLFS). 

The CLES uses the jackknife method for estimating the 
variance of the generalized regression estimator. The jack- 
knife method is computer intensive but it is readily applicable 
to general smooth statistics, unlike the linearization method. 
Moreover, it possesses good conditional properties. For 
example, in the context of simple random sampling and 
the ratio estimator, Royall and Cumberland (1981) showed 
that the jackknife variance estimator tracks the conditional 
variance given the sample mean of the auxiliary variable x. 


The main purpose of this paper is to study variance 
estimation for the ratio-adjusted poststratified estimator 
and the generalized regression estimator under stratified 
sampling. By linearizing the jackknife variance estimator, 
a jackknife linearization variance estimator is obtained 
which is different from the standard linearization variance 
estimator. In the case of the poststratified estimator, this 
variance estimator is identical to Rao’s (1985) variance 
estimator. The proposed variance estimator is computa- 
tionally simpler than the jackknife variance estimator and 
yet leads to values close to the jackknife. 

Section 2 introduces the jackknife variance estimator 
for the basic expansion estimator of the total, Y. Section 3 
presents the jackknife and the jackknife linearization 
variance estimators for the poststratified estimator. These 
results are extended in section 4 to the generalized regres- 
sion estimator in the context of multiple poststratification 
variables. Section 5 deals with variance estimation for a 
ratio of two totals, both of which are estimated using a 
generalized regression estimator. Results of a simulation 
study on the relative performances of the usual lineariza- 
tion variance estimator, the jackknife and the jackknife 
linearization variance estimators are reported in section 6. 


2. BASIC ESTIMATOR 


Using the basic weights w,,;,, an unbiased estimator of 
the population total Y is of the form 


Yt yk WnikYhik > (2.1) 
(hik) és 


' Ww. Yung, Statistics Canada, Household Survey Methods Division, R.H. Coats Building, Tunney’s Pasture, Ottawa, Ontario, K1A OT6; and 
J.N.K. Rao, Department of Mathematics and Statistics, Carleton University, Ottawa, Ontario, K1S 5B6. 
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where s denotes the sample of elements and y,;z is the 
value of the characteristic of interest associated with the 
sample element (hik)€s. For simplicity, we assume 
complete response in this paper. 

It is common practice to sample clusters without replace- 
ment. However, at the stage of variance estimation, the 
calculations are greatly simplified by treating the sample 
as if the clusters are sampled with replacement. This 
approximation generally leads to overestimation of the 
variance of Y, but the relative bias is likely to be small if 
the first-stage sampling fractions are small. 


An estimator of the variance of Y is given by 


ps y & 
v(Y) = ea 


ye (Yai — In)? =V(Yni)> (2.2) 


where Yai = Vi(MAWnik)Ynik» ANd Vy = (1/Ma) Vini- 
The operator notation v(y,;) denotes that v( Y) depends 
only on the y;,’s 

To introduce the jackknife method, we need the esti- 
mator YPN for each (gj) obtained from the sample 
after omitting the data from the j-th sampled cluster in 
the g-th stratum (j = vie et ae ol) se kts 
simply obtained from (2.1) by letting w,, = 0, changing 
Weik (i A Jj) tO NgWeix/ (Ng — 1) and retaining the original 
weights w,; forh # g,i.e., 


0 if (hi) = (g/) 
Ng ‘ ; ; 
Whik(gi) = Wop RS and i# 
hik (gj) (ng tes gik & dj 
Whik ifeee ee g. 


These jackknife weights, wyjix(g;), are calculated for each 
cluster (gj). The resulting estimator of Y is 


Yai) = Ds Whik (gi) Y hik + 
(hik)és 


The jackknife variance estimator is then given by 


iy aes Bd Taha 2 
> ae Vee a2) 


g=1 & j=l 


vj(Y) = 


The variance estimator (2.3) is applicable to general 
smooth statistics, say @ = 2(Y), by simply replacing 
Y ei) and Y with bg) = = £5 ( Yiei) and 6 respectively. In 
the linear case, 6 = Y, the jackknife variance estimator 
is identical to the customary variance estimator (2.2). 


3. POSTSTRATIFIED ESTIMATOR 


Suppose the population is partitioned into C poststrata 
with known population counts .M, c = 1, nG. We 
will use the prescript c to denote poststrata. An estimator 
of .M is given by 


eM = YY wrx, (3.1) 
(hik) cs 


where .s is the sample of elements belonging to the c-th 
poststratum. Similarly, an estimator of the poststratum 
total .Y is 


cY = YY Writ Ynit- 
(hik) es 


Using the estimators ,Y and .M, we obtain a poststratified 
estimator of the total Y as 


LY. (3.2) 


ye 
M4 
& S 


We can rewrite (3.2) as 


Ys = ‘> »s cWhikV hik 


c (hik)€écs 


where .Waix = Whix(cM/-M) is the ratio-adjusted weight 
for (hik)€,s. If ypjj, is the indicator variable for a post- 
stratum, say c, then yo = .M, thus ensuring consistency 
with known totals, .M. 

The standard linearization variance estimator is given 
by (2.2) with y,; changed to 


Eni = YY) (nit) nik 


c k€.s 


where .@nix = Yhik — cY/-M for the k-th element in the 
(hi)-th cluster belonging to ,5, i.e., 


vp (Yps) = v(Eni)- (3.3) 


Rao (1985) proposed an alternative linearization variance 
estimator using the ratio-adjusted weights .Wpj;x: 


VR(Yps) = v(eju) (3.4) 


where 


ea 
hi = Y) YS (th Wnik) c€nik- 


Cc k€es 


Turning to the jackknife method, we need to recalculate 
the poststratification weights .w,; each time a cluster 
(g/) is deleted. This is done by using the jackknife weights 
Wrik(gj) in (3.1) to get -Mi,;) and then using .Wpix(gj) = 
(-M/-M (2j)) Whik (gj) tO get 
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Yps(gi) = a, 3 cWhik (gj) Vhik- 
c (hik)€é-s 


The jackknife variance estimator is then obtained as 


r Ee | 
v7(Yps) = yb "e 
g=1 


MG 
ye CE ay | 9325) 

& j=l 
By linearizing (3.5), we obtain a jackknife linearization 
variance estimator, v,; ( J 53) , which is identical to Rao’s 
variance estimator (3.4); see also Valliant (1993). In the 
important special case of n, = 2 clusters per stratum, (3.4) 
and (3.5) are in fact asymptotically equal to higher order 
terms, as the number of strata L increases (Yung 1996). 
Rao (1985) justified (3.4) on heuristic grounds by noting 
that for simple random sampling it reduces to a condi- 
tionally valid variance estimator given the poststrata 
sample sizes, unlike the standard linearization variance 
estimator (3.3). Sarndal, Swensson and Wretman (1989) 
obtained a variance estimator of the form (3.4) in the 
context of unistage sampling under a model-assisted 
framework. Since vj; ( Ye) and v,( Ye) are approximately 
equal, the foregoing results suggest that both variance 
estimators should be ‘‘robust’’ in the sense of possessing 
good conditional properties given the estimated poststrata 
counts. Valliant (1993) conducted a simulation study to 
demonstrate the ‘‘robustness’”’ of v,( ye) and v7, ( en 


4. GENERALIZED REGRESSION 
ESTIMATOR 


In practice, it is common to form poststrata according 
to two or more auxiliary variables. If the resulting cell level 
population counts are available, the ratio-adjusted post- 
stratified estimator can be used to increase the efficiency 
of the estimates. However, these cell counts may not be 
known in practice. For instance, marginal counts may be 
known only for age groups and race groups but not cell 
counts for the individual age-race groups. This means that 
in terms of a two-way table, the marginal counts are 
known but not the cell level counts. To handle several 
poststratifiers with known marginal population counts, we 
can use a generalized regression estimator of Y by using 
indicator auxiliary variables to denote the categories of 
the poststratifiers (Huang and Fuller 1978; Deville and 
Sarndal 1992). 

Let x,;, be a vector of auxiliary variables with known 
population totals X. The generalized regression estimator 
of Y is then given by 


YOY 4 (XX) 7B, (4.1) 
where 


X = be WnhikXhiks 
(hik) és 
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and B is the vector of estimated regression coefficients 


BSA 
where 
a3 r 
a ye WhikXhik% hik > 
(hik)és 
and 


o= ‘g WrhikXhikYhik + 
(hikes 


The poststratified estimator, Yes is a special case of 
(4.1) by letting x, denote the vector of indicator variables 
for the poststrata. In this case, ¥ = (,M, ..., cM)’, 
Ne = NM set wand Bam (gRe ict. ole) 2 with 
oR = -Y/.M. Thus, 


Y=Y+ ER (ONE aah) FY, 2 


Cc 


In the case of two or more poststratifiers, X corresponds 
to the vector of marginal population counts. 

The generalized regression estimator may be rewritten 
as 


On gaan * 
Y, = » WhikY hik > 
(hik) és 
where 


Whik = Wik Qnik (4.2) 
is the ‘‘final’’ or ‘‘calibration’’ weight with 
Go = ier eA Px eX): 


In the special case of iy we have aj = ~M/.M for 
(hik)€,s. Writing Y, in the operator notation as Y,(Ypix), 
it is readily verified that the generalized regression 
estimator X, = Y,(x,i,) = X, thus ensuring consistency 
with known totals X. 

Turning to variance estimation, the standard lineariza- 
tion variance estimator is again given by (2.2) with y,; 
changed to 


Eni = VY (MaWnik)€nik 
k 


where 


hes 
Chik = Vnik — XhixB (4.3) 
are the estimated residuals, i.e., 


vi (¥,) = v(Ei). (4.4) 
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For the jackknife method we need to recalculate the 
calibration weights w;,, each time a cluster (g/) is deleted. 
These weights are given by 


Whik(gi) = Whik(gi)@hik(gi)> 
where 
4 T pe e 
Crikey) = eA GES Says 
2 HG 
A (gj) = Ss Whik (gi) Xhik* hik» 
(hik) és 
and 
X (gi) = ne Whik (gi) Shik: 
(hik) €s 


Denote the resulting generalized regression estimator as 


Me 


ney = Ne Whik (gi) hik 


(hik)€s 
ast % Tp 
= Yigy + (X — Xgiy)" Bigiy 


where Bai) is the vector of estimated regression coeffi- 
cients when the (g/)-th cluster is deleted: 


a ee ere 
B gi) = Agi) gi) 
with 


igi) = Me Wnhik (gj) XhikVhik- 
(hik)€s 


The jackknife variance estimator of Y, is then given by 


the n 
x Ne Veet e 
VC.) = =a BL ay ae eG) 
g=1 es j=l 
It is shown in the Appendix that by linearizing the 
jackknife variance estimator (4.5), one obtains 


vy (Y,-) = v (ej) (4.6) 
with 


ub uel * 
eni = » (NpWhik) Cnik 
k 


where w7,, is defined in (4.2) and e,;, is defined in (4.3). It 
is interesting to note that the jackknife linearization variance 
estimator (4.6) is similar to the model-assisted variance 
estimator proposed by Sarndal, Swensson and Wretman 
(1989) in the context of unistage sampling. Yung (1996) 
established the asymptotic equivalence of v,(Y,) and 
vy, (Y,) to higher order terms in the important special case 
of n, = 2 clusters per stratum. Note that the above results 
are also applicable to general auxiliary variables, x;;x. 


Binder (1996) proposed a new linearization method which 
also leads to v,, ( Y,). In this method, the partial derivatives 
are evaluated at the estimates Y, X and B, rather than the 
population values Y, X and Bas in the traditional lineariza- 
tion method. Given that v,; and v,;,; are design-consistent 
(Yung 1996) and possess good conditional properties, our 
results provide theoretical justification for Binder’s method 
which was proposed as a ‘‘cookbook approach’’. 

The computation of the jackknife variance estimator 
involves the inversion of the matrix A gj) for each (g/). 
However, the jackknife variance estimator can be approx- 
imated by retaining the inverse for the full sample, A ~', 
and then using modified weights 


Whik (gi) = Whik(gi) Ghik (gi) 
with 
~ 1 4-l v 
Gnik(gj) = 1 + (Wrik/Whik(gj)) Xnixk A(X — X(gjy)- 


The resulting estimator of Y, when the (g/)-th cluster is 
deleted, is given by 


Ca Se Wrhik (gi) hik 
(hik) és 


aS 


and the corresponding jackknife variance estimator is 


n 


iE n 

_ Ve ee wee es 

wni(Y,) = )) = waa — Le. ap 
pee ey mae 


It is readily seen that (4.7) is exactly equal to the standard 
linearization variance estimator (4.4). 


5. ESTIMATION OF A RATIO 
Often a ratio of two estimated totals is required. For 


example, in a family expenditure survey, one may be inter- 
ested in the proportion of income spent on clothing. Let 


Y,= Y + (X — X)'B, 


be a generalized regression estimator of the total amount 
spent on clothing, Y. Similarly, let 


Z=Z + (X — X)'B 
be a generalized regression estimator of the total income, 


Z. The proportion of interest is @ = Y/Z, and can be 
estimated by 
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The jackknife variance estimator is given by 


Oa 


(8 (ej. — 6)? (5.1) 
Ng ; 
& i 


where 
Pai) = Yrcgiy/Lrgi): 


Linearizing the jackknife variance estimator, (5.1), we 
obtain a jackknife linearization variance estimator 


Vy (0) = v(r7i*) (5.2) 
where 
**K 1 2 * 
ri => YO (wade 
eK 
with 
* ae rs 
Chik = nik — ZS ehiks 
r 
and 


ia Fie cl T 4 
Cnik = Vnik — XhikBy, Chik = Zhix — Xhix Bo. 


Proof of (5.2) is omitted for simplicity. 


6. SIMULATION STUDY 


We performed a simulation study to investigate the un- 
conditional and conditional finite sample properties of the 
variance estimators in the case of a single poststratifier as 
well as two poststratification variables. For this purpose, 
we used a fixed finite population, considered by Valliant 
(1993), consisting of 10,841 persons included in the 
September 1988 Current Population Survey (CPS) of the 
United States. The variable of interest, y, is the weekly wages 
for each person. The single poststratifier was defined on 
the basis of age, race and sex, while the two poststratifiers 
were based on the variables age, with five levels, and race, 
with two levels (see Tables 1 and 2 for details). 


Table 1 


Assignment of Age/Race/Sex Categories to Poststrata: 
Single Poststratifier 
ee Ea a eee 


Nonblack Black 
Age —_ 
Male Female Male Female 

19 and under 1 | 1 1 
20-24 2 3 3 3 
25-34 5 6 4 4 
35-64 7) 8 4 4 
6S and over 2 3 3 1 


Note: Cell numbers (1-8) are poststratum identification numbers. 
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Table 2 


Assignment of Age/Race Categories to Poststrata: 
Two Poststratifiers 
a a i a 


Age Nonblack Black 

19 and under (1,1) (12) PS1(1) 

20-24 (2,1) QL) PS1(2) 

25-34 (3,1) (3,2) PS1(3) 

35-64 (4,1) (4,2) PS1(4) 

65 and over (5,1) (5,2) PS1(5) 
PS2(1) PS2(2) 


Note: Number in margins are poststratum identification numbers. 
Cells (i,j) denote poststrata (i = 1, ..., 5; = 1, 2). 


The study population contained 2,826 geographical 
segments, each composed of about four neighbouring 
households. One hundred design strata (L = 100) were 
created with each stratum having about the same total 
number of households. We used a stratified two-stage 
sampling design with segments as clusters and persons as 
the second-stage units. In each stratum 1, = 2 segments 
were selected with probability proportional to the number 
of persons in each segment, and a simple random sample 
of my; = 4 persons was selected without replacement if 
the sample segment contained more than four persons. In 
sample segments with four or fewer persons, all persons 


. inthe segment were selected. Using this design, we selected 


two sets of 10,000 independent samples, one set for the 
one-way poststratification case and the other set for the 
two-way poststratification case. 

From each sample, we computed the basic estimator, 
the relevant poststratified estimator, Ys or Y,, and four 
variance estimators: the standard linearization variance 
estimator v,, the jackknife linearization variance estimator 
Vy_, the jackknife v,, and an incorrect jackknife variance 
estimator v7. In applying the jackknife procedure, it is 
questioned whether or not the ‘‘final’’ or ‘‘calibrated’’ 
weights need to be recalculated each time a cluster is 
deleted. The correct jackknife variance estimator does 
recalculate the ‘‘final’’ weight whenever a cluster is deleted 
while the incorrect jackknife variance estimator fails to do 
this. For the one-way poststratification case, v7 ( i) uses 
the full adjustment .M/.M instead of .M/,.M,.;, when 
the (g/)-th cluster is deleted, i.e., Ya uses the weights 
(.M/.M) Whik(gj) instead of (-M/-M gj) ) Whik (gi) Similarly, 
for the two-way poststratification case, v7 (Y,) uses the 
full adjustment a,;, instead of @pix¢2;) when the (g/)-th 
cluster is deleted, i.e., Y, uses the weights Wpix(oj)Gnix 
instead Of Whix(gj)Gnix(gi)- The linearized version of v7 is 
the same as the variance estimator vp (equation 3.4) with 
cenik Teplaced by y, jz, in the case of ae and v,, (equation 
4.6) with e,;, replaced by y,;, in the case of the generalized 
regression estimator Y,. That is, 


V5 (Ys) = V(Yii) 
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with 
Yai = ae: be (Mn cWhik) Vnik 
c ké€es 
and 
v3(¥,) = vii) 
with 


Yi = (1, Whik) Vnik- 


kés 


Since v7 uses the y’s instead of the residuals e’s, it is clear 
that v7 should overestimate the true variance of the esti- 
mator, although it is computationally simpler than v,. 


(i) Unconditional Results 


To compare the unconditional performances of the 
variance estimators we computed the empirical relative 
bias (RB) for each variance estimator: RB of a variance 
estimator v is 


1 1 
ibe a vl cea 
MSE hci > 1 


i 


where y; is the value of v for the i-th simulated sample 
(i = 1, ..., 10,000) and MSE is the empirical MSE of the 
estimator, say Y: 


1 
10,000 


MSE = Sit Sole y)2 
i 
where Y; is the value of Y in the i-th simulated sample. 


Error rates for normal theory confidence intervals on 
the total Y were also calculated for each variance esti- 
mator, using a nominal error rate of 5%: 


error rate = 


1 
1 — —— (number of samples with L; < Y s U)), 
10,000 


where L; < Y < U;is aconfidence interval on Y for the 
i-th simulated sample. Lower and upper error rates were 


calculated as: 


lower error rate = 


(number of samples with Y < L;) 
10,000 


upper error rate = 


(number of samples with Y > Uj). 
10,000 


We also calculated the average lengths of the confidence 
intervals as 


| 
average length = U; — Lj). 
ge leng loon du ( ") 


Table 3 reports the unconditional results for the post- 
stratified estimator ve: using the above performance 
measures. With respect to relative bias, vj, and v, both 
perform well with RB < 1% while the incorrect jackknife 
vi, severely overestimates the MSE (RB = 37%). We 
note that v, is also estimating the MSE of 1 well un- 
conditionally (RB < 1%), contrary to Valliant’s (1993) 
claim. Valliant (1993) reported RB of 35% for v, using 
the same data set. In view of the design-consistency of v, 
supplemented by our simulation results on v,, we conjec- 
ture that Valliant’s calculations on v, might be incorrect. 


Table 3 
Unconditional Results for the Poststratified Estimator 


i ae vn Sa) va Sos) vrLos) VEC) 
Relative bias (%) — 0.44 0.12 0.26 37.16 
Error rate (%) 5.20 5.09 5.06 2.41 
Lower error rate (%) 2.41 AeS}5) M38 0.99 
Upper error rate (%) D9 2.74 Dele 1.42 
Average length 3.81 3.82 3.83 4.48 


Turning to confidence interval performance, Table 3 
shows that the error rates associated with v;, vy, and v;, 
are close to the nominal 5% while the error rate for v7 is 
considerably lower than 5% (about 2.5%). Performances 
with respect to lower and upper error rates are also similar. 
The variance estimators, v;, vy, and v,, perform similarly 
in terms of average length of confidence intervals while 
the average length associated with v7 is significantly 
larger due to overestimation bias. Finally, we note that the 
performance measures for v, and v,, are very close, 
supporting the asymptotic equivalence of v, and vj. 


Table 4 
Unconditional Results for the Generalized Regression 
Estimator 
Performance = A 3 ome 
Menshite vi(%) vat) vs %) oa) 
Relative bias (%) — 0.96 0.76 0.57 25.87 
Error rate (%) 5.30 S741) 5.23 3.07 
Lower error rate (%) 2.24 Dea D9) 1.08 
Upper error rate (%) 3.06 3.06 3.04 1.99 


Average length 3.94 3.95 3.95 4.44 
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Unconditional results for the generalized regression 
estimator Y, are reported in Table 4. As in the case of 
Vass the variance estimators v,;, vy, and v, perform well 
both in terms of relative bias and error rates of confidence 
intervals. On the other hand, the incorrect jackknife v7 
leads to severe overestimation which in turn is reflected 
in the lower than nominal error rates and larger average 
length of confidence intervals. 


(ii) Conditional Results 


We have also studied conditional properties of the 
variance estimators, following Valliant (1993). For the 
poststratified estimator, we divided the 10,000 simulated 
samples into 10 groups each containing 1,000 samples 
using the measure (Valliant 1993) 


The measure D,, was calculated for each sample and the 
10,000 samples were sorted in ascending order according 
to the D,,-values and then divided into groups. We may 
interpret D,, as a measure of how “‘balanced’’ the sample 
is with respect to the distribution of the poststrata counts. 

For the generalized regression estimator, we used the 
following natural extension of D,;: 


o-E (Hi) -E(C-). 


a b 


where a and b index the levels of the two poststratification 
variables and (,M, ,M) and (,M, ,M) are the corre- 
sponding marginal counts. We may interpret D, as a 
measure of how ‘‘balanced’’ the sample is with respect to 
the distribution of the marginal poststrata counts. 


Table 5 
Conditional Relative Biases (%) for the Poststratified 
Estimator 
Group vz (Yps) vat ( Yps) v7( Yps) V5 (Xng) 
1 — 5.00 — 8.05 — 7.88 17.83 
Dy On — 1.18 —1.01 28.06 
3 8.33 7.03 7.19 41.29 
4 —1.10 — 1.56 — 1.42 31.82 
5 — 0.76 — 0.69 —0.55 34.77 
6 2.50 3.39 3.53 41.69 
4 6.10 Toei] 7.66 48.86 
8 6.60 8.82 8.96 53.54 
9 —4.46 — 1.43 —1.31 41.11 
10 — 13.56 -—9.17 —9.07 36.63 
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Table 6 
Conditional Error Rates (%) for the Poststratified 
Estimator 

Group = vz(Yps) Va (Yps) Va ps) V5 Eps) 
1 By) 5.9 5.9 3.4 
2 4.6 4.8 4.8 2.9 
3 87) 3.8 3.8 1.9 
4 So 5.8 5.8 2.9 
5 4.9 4.8 Anil 2.6 
6 Sul 5.0 4.8 2.2 
7 Sez 4.8 4.8 Dal 
8 4.5 4.3 4.3 1.3 
9 5.8 5.4 5.4 2.4 
10 7.0 6.3 6.3 2.4 


The results for the poststratified estimator are given in 
Tables 5 and 6: conditional relative biases in Table 5 and 
conditional error rates (nominal 5%) in Table 6. These 
performance measures were computed in the same manner 
as the unconditional case but from each group separately. 
It is clear from Tables 5 and 6 that v,;, v,, and vy all 
perform well, although v,; is somewhat worse in the 
extreme groups | and 10, while v7 performed poorly as 
before. It is somewhat surprising to see v, performing so 
well conditionally. A possible explanation is that with our 
particular sampling design we have M = Y (nixyes Wnix = M 
so that 


YM =M=M. 


CG 


Because of this, we do not obtain samples which are poorly 
balanced since if some poststrata counts .M are gross 
overestimates, say, then the other counts correct for the 
overestimation in order to satisfy the above constraint. 
Thus, we see mostly well balanced samples in which case 
v,, is expected to perform well. 


Table 7 


Conditional Relative Biases (%) for the Generalized 
Regression Estimator 


Group vi (¥;) vy (Y) me) v3 (Y,) 
1 9.25 4.95 SiH 26.51 
» 3.99 1.50 1.67 24.96 
3 — 3.24 —4.76 —4.59 Tess} 
4 — 2.66 — 3.43 — 3.26 20.53 
5 7.90 7.61 7.80 35.46 
6 — 3.60 —3.12 —2.94 23.38 
a —9.24 — 8.27 — 8.08 17.41 
8 3.34 5.30 5.50 35.84 
9 —3.75 — 0.85 — 0.62 30.84 

10 — 8.68 —4.15 — 3.92 28.50 
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Table 8 
Conditional Error Rates (%) for the Generalized Regression 
Estimator 
Group vr (¥;) vy (¥;) v7(Y;) v7 (Y,) 
1 4.3 4.5 4.4 3.0 
2 4.9 5.0 5.0 3.3 
3 520) Sail Srl 3.8 
4 Sid 5.9 5.9 3h) 
5 3.9 4.0 4.0 28 
6 Sell 58) Sil 3.0 
7 5.9 5.8 5.8 2.9 
8 5.8 Sa Sa 2.8 
9 55 Sal 4.9 3.0 
10 6.3 5.8 5.8 333 


The results for the generalized regression estimator are 
given in Tables 7 and 8: conditional relative biases in 
Table 7 and conditional error rates (nominal 5%) in 
Table 8. The results are very similar to those for the one 
stratifier case. In both cases we again note that the perfor- 
mance measures for v, and v,, are very close, supporting 
the asymptotic equivalence of v, and v,,. 

In summary, the three variance estimators v,;, v,, and 
v, performed similarly. The incorrect jackknife v7 per- 
formed poorly indicating that reweighting must be done 
each time a cluster is deleted. 


7. CONCLUDING REMARKS 


Beebakhee (1995) applied the three variance estimators, 
vy, Vjyz and v,, to a number of household surveys con- 
ducted by Statistics Canada. Her empirical results showed 
that the jackknife linearization variance estimator, vj, , 
consistently consumed less time and money for all study 
surveys than the jackknife variance estimator, v,, and yet 
approximated v, very well. These results are practically 
important because the users wanted a computationally 
simpler variance estimator which can approximate the 
currently used v, very well. The standard linearization 
variance estimator v, performed similar to v,, in terms 
of cost and time, but it did not approximate v, as well 
as Vj). 

If the primary interest is the estimation of totals or 
ratios, then the jackknife linearization variance estimator, 
Vy_, 1S attractive because it is computationally simpler 
than the jackknife variance estimator, v,, and yet leads 
to values close to the jackknife. But for general smooth 
statistics v,, suffers from the same disadvantage as the 
standard linearization variance estimator, v,, in the sense 
that both require the derivation of a separate formula for 
each statistic, unlike v,. In terms of statistical properties, 
our simulation study suggests that the three variance 


estimators, v;, vj,, and v,, perform similarly. On the 
other hand, the incorrect jackknife v7, which uses the 
same adjustment whenever a cluster is deleted, performs 
poorly indicating that reweighting must be done each time 
a cluster is deleted. 
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APPENDIX 


Proof of the Result v,(¥,) ~ v,(¥,) 


To establish the desired result, we first approximate the 
difference A (,j) — A~'. Using the matrix identity, 


(I+ PQ)-!'=1- P(I+ OP)'@ 


we get 


Aig — A) = A+ (Ag — AA! - AT! 
= Aq'[I — (Ay; — A) 

(I + ADWA wiyeseA tA lace 

= = A M(A(, =VADAQ™ (A.1) 


The approximation (A.1) follows by noting that (i) 
Ag; — A is of lower order than A under the assumption 
that no cluster contribution is of disproportionate size as the 
number of strata L increases (see Yung (1996) for details 
on regularity conditions) and (ii) [J + A~!(A jy) —A)] 7! = 
I — A7'(A (gj) — A). 
Using (A.1), we obtain 
—A~'b 


U 


(AG —471)6 + A! (8,4) — 5) 


v 


— A~'(A(.;) — A)B + A~! (Bie; — 5). 
(A.2) 


It now follows from (A.2) that 


Ya aaa a) eg Ngee) ae 


== (Xe OBE 7B) 


U 


er 2c"), A.3 
eres ss ail ( ) 
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where e% = Vx (Mgweix) giz and és = (1/n,) Ljez;. We 
used the following results in arriving at (A.3): 


Wan 1) (Xgy = X)78 = (€, — eg) 
ne — 1 
and 
(X — X)"(Bij, — B) = 
Xe xy ii, — U,;) |, 
( ) fs = (a, »| 


where @g5 = Vig (NgWejx)Cgix ANd Ug; = Vk (Ng Wei )X pik Cgik « 


It now follows from (A.3) that 


L 
v(¥,) = y 


= v(ehi) = vy, (Y,). 
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Small Area Estimation Under an Inverse Gaussian Model 


Y.P. CHAUBEY, F. NEBEBE and P.S. CHEN! 


ABSTRACT 


In this paper, we consider analysis of variance methodology for inverse Gaussian distribution and adapt it for estimation 
of small area parameters in finite populations. It is demonstrated, through a Monte Carlo study, that these estimators 
offer a competitive choice for positively skewed survey data such as income or yield of a particular sector. 


KEY WORDS: Interactions; Inverse Gaussian; Monte Carlo; Regression estimates; Synthetic estimates; Sarndal- 


Hidiroglou estimator; Unbalanced model. 


1. INTRODUCTION 


Recently, a large number of methods appeared in the 
literature for the problem of small area estimation; for 
example Prasad and Rao (1990), Sarndal and Hidiroglou 
(1989), Choudhry and Rao (1988), and Sarndal (1984) and 
the references cited there, especially Sarndal and R&back 
(1983), Fay and Herriot (1979), Schaible (1979), Holt, 
Smith and Tomberlin (1979), and Gonzalez and Hoza 
(1978), to name a few. The need for small area estimates 
of several characteristics of a given population has gener- 
ated various useful procedures that produced realistic and 
sufficiently accurate estimates for local areas and other 
special subgroups. Several of the techniques suggested by 
the authors mentioned above were implicitly and/or 
explicitly model-based and utilized the standard normal 
theory. Others have tackled the provision of estimates for 
local areas from Bayesian and empirical Bayes perspectives 
by finding a compromise between the sample mean of an 
area (that is assumed to be normal) and an estimator based 
on regression on one or more covariates (see e.g., Stroud 
1987; MacGibbon and Tomberlin 1989). For an extensive 
review of recent developments in small area estimation, 
the reader may refer to Ghosh and Rao (1994). 

The standard normal theory analysis of factorial exper- 
iments may be inappropriate to apply in situations where 
data are generated from markedly positively skewed 
distributions. While most of the inference procedures are 
analytically tractable, the accuracy and reliability of the 
results may be questionable in many practical applications. 
Thus, such an analysis based on positively skewed distri- 
butions is called for. 

The objective of this paper is to consider inference proce- 
dures for unbalanced as well as balanced two-factor exper- 
iments under inverse Gaussian model that may be used to 
produce estimates for small regions. Hidiroglou and Sarndal 
(1985) reported on a Monte Carlo study where a modified 


regression estimator is preferred as a compromise between 
the synthetic estimator and the generalized regression 
estimator. Sarndal and Hidiroglou (1989) also presented 
further comparisons of estimators on the basis of condi- 
tional inference. The generalized regression estimator is 
basically derived from a super population regression 
model without any distributional assumptions. Chaubey 
(1991) considered super population models of Durbin 
(1959) with gamma auxiliary and inverse Gaussian auxil- 
iary in which case the generalized regression estimator has 
the property of being the best linear unbiased predictor 
(see Prasad and Rao 1990). In fact, the best linear unbiased 
predictor for the population total does not depend on the 
form of the distribution of the characteristic variable, 
hence this technique is preferable given that maximum 
likelihood estimates (MLE) may be hard to obtain. As we 
have seen that the super population distributions (as 
transfused in the populations) may resemble closely to 
inverse Gaussian distributions for variety of populations 
we would like to exploit this aspect of the population. 
The use of inverse Gaussian distribution is not merely 
a superficial one but it has been used successfully in many 
situations (see Folks and Chhikara 1978) and resembles 
closely to gamma, log normal and Weibull populations 
which are common in modeling positively skewed non 
negative random variables. In this paper, we study the use 
of inverse Gaussian model in applying to the small area 
estimation. The approach of Fries and Bhattacharyya 
(1983) which discusses the analysis of two factor experi- 
ments under an inverse Gaussian model is of major impor- 
tance. The above paper gives estimation in balanced, 
no-interaction model. We have extended this approach to 
unbalanced case, which is essential for estimation of 
domain totals or means. In this respect the general multiple 
regression approach of Bhattacharyya and Fries (1986), 
and Whitmore (1983) may be adapted, but we have chosen 
to take the direct approach. In Section 2 we specify the 
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model and present our proposed estimators under the 
inverse Gaussian model. In Section 3, a numerical study 
is carried out for evaluation of the performance of the 
proposed estimator through Monte Carlo simulation. 
Finally, Section 4 presents summary and conslusions. 


2. THE INVERSE GAUSSIAN REGRESSION 
MODEL FOR SMALL AREA 
ESTIMATION 


Suppose that a finite population U is divided into D 
non-overlapping domains U,, d = 1(1)D, with Ng as 
the size of U,.. The population is further divided along a 
second dimension, into G non-overlapping groups U_,, 
g = 1(1)G, with the size of U, denoted by N,. The 
cross-classification of domains and groups give rise to DG 
population cells Ug,,d = 1(1)D,g = 1(1)G, with Ng, 
as the size of Uy,. The population size N can then be 
expressed as N = YgN, = ye Nese dee ite OUF 
interest lies in estimating domain totals tg = Yu, Yes 
where y represents the characteristic variable and y, is the 
observation on k-th unit. A sample s of size n is selected 
from U by a simple random sampling. Denote by sz, 5, 
and Sg, the parts of s that happen to fall in Uy, U., and 
Ug,. The corresponding sample sizes are denoted by nz, 
n., and ng, respectively. 


2.1 Regression Method for Inverse Gaussian Data 


We refer readers to two recent comprehensive reviews 
about the developments in the inverse Gaussian distri- 
bution, namely, Chhikara and Folks (1989), and Iyengar 
and Patwardhan (1988). The probability density function 
of an inverse Gaussian variate with parameters (0, o), 
IG(6, 0), is given by 


f(y30,0) = (200) 7 y~ >? exp[— (20y) ~!(y97! — 1)7]; 
(2.1) 


with y > 0,8 > 0,0 > 0. The mean and variance of this 
distribution are 6 and 670, respectively. Bhattacharyya 
and Fries (1982) proposed a reciprocal linear model for 6. 
Specifically, they assume a model of the form 6,.! = x{7. 
An estimator of 7, similar to the estimator of the regression 
parameter in the usual linear model (see Sarndal 1984) in 
this situation is given by 


i=(D ae ae (2.2) 


TT 
keSq. k keSq “* 


This is called pseudo Maximum Likelihood estimator, 
because it is obtained by unconditional maximization of 
the likelihood function and therefore xi7 > 0 may not 
be satisfied for all k. Then an estimator of the total t, of 


the d-th domain in the spirit of Sarndal’s (1984) modified 
regression estimator may be constructed as 


Wes o ie )Y = (2.3) 
k 


keUg. keSq. 


where ¥, = xij ande, = y, — Y,. In what follows, we 
denote the mean of the (d,g) cell by 6g, and consider the 
case of simple random sampling in which case 7,’s are con- 
stant. We first discuss the prediction of observations for 
the use of (2.3) based on an additive effects model given by, 


6%! = + 0gt+ Be, Yioe=) 8, =0, (2.4) 


where p, aq’s and £,’s represent the overall effect, the 
domain or row effects, and the group or column effects, 
respectively. For the inverse Gaussian distribution we must 
also have 64, > 0 for all (d,g) and o > 0. Thus the para- 
meters Hy, a = (a, AQs+++5 ap), B = (81, Bo, OaooOit) Bo), 
and o lie in the set Q = {(u,a,8,0): Yaag = 0, Ye By = 9; 
u + ag + B, > 0, V(d,g); o > 0}. Under this setup 
estimation of parameters for prediction can be accom- 
plished through unconditional maximization of the like- 
lihood function. Conditional on the population and the 
sample sizes ng, and referring to (2.1) and (2.3), the log- 
likelihood function of the parameters is given by 


1 
(=~ logo)! )) Mas 
Gh ts 


— (20) TY YY Yaak (Vag (u + xg + Bg) — 1]. (2.5) 


Gipretin eke 


We first note that the parameters are effectively given by 
IE Fe age Ue are Bt ly eo) wher ct prereeh 6 A 
Thus, differentiating the above with respect to (pn, ag, By, 
= 1327057 D —ls¢ ="1,2,7.., G— Tyand equating 
the resulting partial derivatives to zero gives the following 
equations for the estimators (jf, a, on Git ee 
CMe aT. Bae (G Se))s 


D-1\ G-1 
BY. act wba Cdk at gickD, Nn ta opieeldegeeGacaihes 
d=1 gail 


D-1 
B(¥a — Yd.) + Gava + ie QjYp. 
- 2 


G-1 
ae ye Bel Wap — Yoe) — Wace — YoG) eng acips 
g=l1 


Dei 
BO .Geit Wy Gal. Vda aeXdG) + Vp zW/pe) 3 
ani 
} G-1 
eV aerct 2 BiG =N_g—Ng, (2.6) 


Jal 


Survey Methodology, June 1996 


where the totals and means are represented by the 
notations 


dg = ye Ydgk> IAG he 5 Ydg> Jig — oe Vdg> (2.7a) 
k g d 


Ng. = >» Nag, Ng = 5 Nag, 1 = > De Ndg- (2.7b) 
g d di: exis 


The solutions (ji, dy, By), d = 1(1)D, g = 1(1)G, 
provide the pseudo Maximum Likelihood estimator and 
may not yield nonnegative response estimates but will 
coincide with proper MLE as ngg — © (see Fries and 
Bhattacharyya 1983) with probability one. Negative 
values of the response estimates may thus be truncated 
to zero. 

In the case of the JG(6, o) model with interaction, the 
usual parameterization of the interaction effects suggests 
the model 


Dag = bh + ag + By + Yags 


Ye =) Be =) vag = V5 vag = 9, = (2.8) 
d g 


where now vq is the interaction effect when domain is at 
the d-th level and group is at the g-th level. The estimators 
of parameters may be obtained in this case following the 
method outlined above. However, noting that the max- 
imum likelihood estimator (MLE) of 64, is Ja, and there 
is one to one relation between the parameters in the repa- 
rametrized model in terms of (y, ag, Bg, Yag) and the 
original parameters 64, , explicit formulae for the MLE of 
different parameters are not needed. Corresponding to 
equation (2.3), therefore, for a two-factor model with 
interaction, our estimator is 


tawt = 3 Nag Vag > (2.9) 
g 


which is the post stratified estimator and is not of further 
interest in small area estimation. For the model without 
interaction, the estimator is given as 


tawor = > Nag 9ae + ys Nag (ag — 94g), (2-10) 
g g 


where bag =pt qt Be, the estimators being 
obtained from (2.6) and Ng = NggN/n... 

In order to judge the effectiveness of this estimator a 
numerical study has been performed and is reported in the 
following section. 


35 


3. A NUMERICAL STUDY OF THE INVERSE 
GAUSSIAN REGRESSION 
ESTIMATOR 


In this section we provide the results of a simulation 
study which evaluates the performance of the estimators 
developed in the previous section. The modified regression 
estimator due to Sarndal and Hidiroglou (1989) given 
below will be used as the bench mark for the above 
purpose; 


lus-H = yy NaS 2 + ry FyNag(Jag — Jeg), (3-1) 
g g 


where Fy = Nz/Nz if Ny = Nz, otherwise Fy = Nz /Nz. 
Here, Nz = nqN/n.. An alternative form of this esti- 
mator which takes into account both group and domain 
effects can be obtained by replacing ¥., by J, + Fa, — J. 
but this has not been pursued here. It should be noted that 
the above estimators cannot be computed when 7q, is 
zero. When this happens the estimators are simply taken 
to be the sample means of the respective domains. We also 
include the following modified version of tyyo,, 


laWOIM = a Nag 9 ae ar ‘es Fy Nag Vag = bag) (3.2) 
& & 


for comparison. 


3.1 Design of the Simulation Study 


We consider Household Income data for Canadians in 
1986, obtained from Household Income, Facilities and 
Equipment microdata tape of Statistics Canada (1987), for 
generating the values of parameters to be used for simula- 
tion. Using Household incomes, from these data, dividing 
them into 10 provinces and 6 educational groups, we first 
fit an inverse Gaussian model given by equation (2.4). The 
estimates of parameters are then used in forming the true 
parameters of the inverse Gaussian super population model 
which are summarized in appendix A. The values of D, 
G, Nag are chosen from this population (see appendix B), 
where D represents the number of provinces (i.e., D = 10) 
and G represents the number of education groups (i.e., 
G = 6). Further sets of values of 04, and o are obtained by 
considering various combinations of (c,,c2); c; = 0(1)4 
and cy = 1, .25, .1, .01 where c, is used to transform 64, 
to 10° “164, and c, is used to transform a to c,0. Note that 
c; = 0 and cy = 1| gives the parameter values for the 
original population. Also, the higher values of c, indicate 
smaller values of the means and those of c, indicate 
higher value of the dispersion parameter. 

For the simulation study, first we generate for a given 
set of 0g, and o values an inverse Gaussian random 
sample using the algorithm in Michael e¢ a/. (1976) with 
number of observations according to the values given in 
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the appendix B. This random sample is then used as a finite 
population from which we select 1000 random samples for 
each of the sample fractions, 1%, and 5% with replace- 
ment. We had actually selected several random samples 
and obtained similar results as reported here. From each 
sample we computed the estimators of totals for the 
10 domains using estimators fys_, tawor and tawom: 
The criteria for evaluating the performance of the esti- 
mators are the mean absolute relative error (MARE) and 
the absolute relative bias (ARB) defined as follows: 
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Table 1 
Mean Absolute Relative Error (%) of Different Estimators 


1% Sample 5% Sample 
Domain 

SH WOI WOIM SH WOI WOIM 

cy = 0,ce, = 1 
1 27) B05) 13.19 6.60 6.48 6.47 
2 Saf ISI 14.20 UBS 7.61 7.69 
3 25-2) Qe SO 26.88 19.07 20.74 20.80 
4 W835 1-70 11.74 5.29 5.61 5.59 
5 WOSy Il 11.68 6.80 7.10 Tell 
6 7.12 7.45 ee? 3.85 3.95 357) 
7 IE /S eel oeo 14.23 7539 8.01 8.05 
8 11.48 12.56 12.46 6.70 Tails) 7.14 
9 7.43 7.92 7299 3.61 3.74 S05 
10 157.32) S17k48 17.16 120 Fees 11.80 

ey = 2yep el 
1 3.34 2.18 Zale 1.66 0.79 0.78 
2 4.14 3.94 3.82 2.14 1.07 1.06 
3 2.44 1.67 1.65 Leh]; 0.71 0.70 
4 2.05 1.70 1.69 0.98 0.70 0.70 
5 1.08 Melb) 1.16 0.50 0.51 0.51 
6 1.74 1.14 1.14 0.78 0.52 0.52 
7 1.90 1857] 1.56 0.91 0.72 0.72 
8 1.48 1.38 1.38 0.70 0.60 0.60 
9 1.41 1.30 1.29 0.67 0.59 0.58 
10 122) 1.38 1.38 0.56 0.59 0.59 

Gy = 4,0 = 
1 2.99 1.48 1.44 1.47 0.08 0.08 
2 3.54 3.37 Bro 1.86 0.14 0.13 
3 1.81 0.45 0.44 0.87 0.07 0.07 
4 32 0.36 0.35 0.66 0.07 0.07 
5 0.27 0.13 0.13 0.11 0.05 0.05 
6 1.29 0.13 0.13 0.55 0.05 0.05 
7 12 0.31 0.31 0.56 0.07 0.07 
8 0.81 0.18 0.18 0.38 0.06 0.06 
9 0.69 0.14 0.14 0.30 0.06 0.06 
10 0.26 0.15 0.15 0.10 0.06 0.06 


P laa eee 
MARE(¢,) = —— tai — tq |\/t (3.3) 
d THe Lu Wigan stg id 
1000 
ARB(iz) = |—— tii — ta |/ta- (3.4) 
(tq) 1000 yy di d [tq 
Here ¢, denotes a typical estimator of ty and 74; denotes 
the value of the i-th Monte Carlo sample (7 = 1, ..., 
1000). 
1% Sample 5% Sample 
SH WO! WOIM SHEL WOI WOIM 
cy = 0,c¢ = .01 
Sei, 2.46 2.45 1.80 0.89 0.89 
3.79 3.56 3.48 2.10 0.59 0.60 
Mesa Wil Lis 1.19 0.77 ORE 
1.83 1.08 1.09 0.93 0.58 0.58 
0.92 0.90 0.91 0.42 0.40 0.40 
1.94 1.22 22 0.93 0.64 0.64 
E22) 1.13 1.14 0.86 0.64 0.64 
1.29 0.93 0.94 0.76 0.67 0.68 
3.47 2.99 2.96 Bais 2.97 2.96 
0.93 0.94 0.95 0.52 0.52 0.53 
(Se Dy (= .O1 
2.99 1.48 1.44 1.47 0.08 0.08 
0.54 BeSi Seog. 1.86 0.14 0.13 
1.81 0.45 0.44 0.87 0.07 0.07 
1.32 0.36 0.35 0.66 0.07 0.07 
0.27 0.13 (1163 0.11 0.05 0.05 
1.29 0.13 0.13 0.55 0.05 0.05 
p27? 0.31 0.31 0.56 0.07 0.07 
0.81 0.18 0.18 0.38 0.06 0.06 
0.69 0.14 0.14 0.30 0.06 0.06 
0.26 0.15 0.15 0.10 0.06 0.06 
CG = Ale, = WI 
2.99 1.45 1.41 1.47 0.01 0.01 
3.54 3.36 3) 725) 1.87 0.05 0.05 
1.80 0.38 0.37 0.86 0.01 0.01 
sil 0.28 0.27 0.66 0.01 0.01 
0.24 0.06 0.06 0.10 0.01 0.01 
1.29 0.06 0.06 0.54 0.01 0.01 
1.20 0.24 0.24 0.55 0.01 0.01 
0.79 0.09 0.09 0.37 0.01 0.01 
0.68 0.06 0.06 0.29 0.01 0.01 
0.23 0.07 0.07 0.09 0.01 0.01 
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Table 2 
Absolute Relative Bias (%) of Different Estimators 


1% Sample 5% Sample 1% Sample 5% Sample 
Domain 
SH WOI WOIM SH WOI WOIM SH WOI WOIM SH WOI WOIM 
¢ = 0% = 1 ¢, = 0,c = .01 
1 4.34 2.40 2 1.87 0.26 0.27 2.66 1.58 1.54 i 0.03 0.03 
2 8.88 3.46 4.39 2.18 0.30 0.23 Bald 3.40 3.31 1.38 0.04 0.04 
3 8:13 3.47 2.74 0.51 iLsi be Halts 1.44 0.31 0.32 0.68 0.01 0.01 
4 1.57 0.51 0.53 0.50 0.21 0.22 eat 0.29 0.30 0.53 0.03 0.03 
5 0.13 0.33 0.35 0.20 0.16 0.18 0.10 0.03 0.02 0.05 0.01 0.01 
6 1.09 0.14 0.04 0.02 0.39 0.42 1.09 0.03 0.03 0.43 0.02 0.01 
7 1.20 1.09 1.59 0.54 0.28 0.30 0.99 0.22 0.23 0.43 0.01 0.01 
8 0.40 0.04 On2, 0.20 0.53 0.54 0.55 0.00 0.01 0.28 0.03 0.03 
9 1.03 0.47 0.36 0.24 0.04 0.01 1.01 0.35 0.37 0.45 0.14 0.14 
10 1.05 Dai 2.03 0.04 0.30 0.29 0.08 0.02 0.01 0.06 0.01 0.01 
Ca as on 1 (Gi pk. gy = 01 
1 2.40 1.37 1-33 13 0.01 0.01 2.47 1.43 1.39 ileils) 0.01 0.01 
24 3.00 3.28 3.16 E55 0.02 0.01 3.06 3.34 3.24 1.36 0.03 0.03 
3 1253 0.39 0.38 0.70 0.04 0.04 1.46 0.35 0.34 0.65 0.01 0.01 
4 1.00 0.25 0.25 0.53 0.04 0.04 1.01 0.23 0.23 0.49 0.00 0.00 
5 0.10 0.02 0.03 0.04 0.00 0.01 0.10 0.01 0.02 0.04 0.00 0.00 
6 1.16 0.01 0.01 0.47 0.02 0.02 pales 0.01 0.00 0.46 0.00 0.00 
7 1.00 0.27 0.27 0.42 0.00 0.00 0.95 0.21 0.21 0.41 0.00 0.00 
8 0.48 0.04 0.04 0.25 0.01 0.01 0.57 0.04 0.04 0.26 0.00 0.00 
9 0.64 0.06 0.05 0.27 0.02 0.02 0.61 0.01 0.00 0.26 0.00 0.00 
10 0.01 0.02 0.02 0.02 0.00 0.00 0.06 0.01 0.01 0.03 0.00 0.00 
cq = 4,c¢ = 1 c, = 4,c¢ =401 
1 2.47 1.43 1.39 il 0.01 0.01 2.48 1.43 1.39 1.15 0.00 0.00 
2 3.06 3.34 3.24 1.36 0.03 0.03 3.07 3.35 3.24 1.36 0.04 0.04 
3 1.46 0.35 0.34 0.65 0.01 0.01 1.45 0.34 0.34 0.64 0.00 0.00 
4 1.01 0.23 0.23 0.49 0.00 0.00 1.01 0.24 0.24 0.49 0.00 0.00 
5 0.10 0.01 0.02 0.04 0.00 0.00 0.11 0.01 0.02 0.04 0.00 0.00 
6 Hels 0.01 0.00 0.46 0.00 0.00 RS 0.01 0.00 0.46 0.00 0.00 
7 0.95 0.21 0.21 0.41 0.00 0.00 0.94 0.20 0.20 0.41 0.00 0.00 
8 0.57 0.04 0.04 0.26 0.00 0.00 0.58 0.04 0.05 0.26 0.00 0.00 
9 0.61 0.01 0.00 0.26 0.00 0.00 0.60 0.00 0.00 0.25 0.00 0.00 
10 0.06 0.01 0.01 0.03 0.00 0.00 0.06 0.01 0.01 0.03 0.00 0.00 


3.2 Analysis of Results 


The MARE values computed according to (3.3) and the 
ARB values from (3.4) for the three estimators and for 
different sample sizes are reported in Tables 1 and 2, 
respectively for a selection of pairs (c;, Cc). The values 
of c,; are chosen to represent, large means (as in the 
original population, c, = 0), moderate means (c; = 2) 
and small means (c; = 4), whereas, the values chosen 
for c) represent the original dispersion parameter 
(cy) = 1) anda further smaller value (c, = .01). It may 


be interesting to note that increasing c, by 1 while keeping 
C> fixed reduces the coefficient of variation by a factor 
of 10. 

Some of the MARE and ARB values reported in 
Tables 1 and 2 are also plotted for visual inspection in 
Figures 1 and 2 for 1% samples, respectively. 

When comparing the MARE and ARB values, reduc- 
tions in biases as well as in relative errors are observed in 
many cases for both 1% and 5% samples. It is found that, 
the MARE and ARB values decrease with decreasing 
values of mean and dispersion parameter o. Reductions 
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ac, = 0,c. = 1 


MARE — % 


*: S-H (3.1) 


0: WI (2.10) 


b. c, = 0,c. = 0.01 


d. Cc = Pop CQ = 0.01 


Province 


+: WOIM (3.2) 


Figure 1. Mean absolute relative errors for different estimators for 1% sample. 


are substantial, especially in case of 5% sample and/or 
when means are small. Note also that the reductions in bias 
are generally larger than reductions in the errors. We may 
note from Johnson and Kotz (1970, p. 141) that for fixed 
value of the mean, the standardized inverse Gaussian 
distribution tends to unit normal as the coefficient of 
variation tends to zero. Since larger gains in MARE and 
ARB values are noted for small values of the coefficient 
of variation, we conclude that proper modeling of the 
mean is important when the coefficient of variation is 
small for model based estimation. 

We further find that (yo; and fyyory have almost same 
MARE and ARB which indicates that the modification 


of the estimator in (2.10) is not necessary. It may be 
remarked that the estimator fjs5_;,, in contrast, has been 
demonstrated (see Hidiroglou and Sarndal 1985) to be 
substantial improvement over the corresponding un- 
modified estimator due to Sarndal (1984). 

Owing to the criticism of fywo, and fawom as being 
model dependent, we want to defend these on the following 
grounds. The inverse Gaussian distribution offers a variety 
of shapes and may be able to approximate lognormal, 
gamma, Weibull and such other positively skewed shapes. 
If we suspect that the principal characteristic is positively 
skewed, then the methodology we discussed here is viable 
and useful. 
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a. ¢ = 0,¢ = 1 
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b. C= 0, C= 0.01 


*: S-H (3.1) 


0: WI (2.10) 


Province 


+: WOIM (3.2) 


Figure 2. Absolute relative biases for different estimators for 1% sample. 


4. SUMMARY AND CONCLUSIONS 


The generalization of analysis of variance methodology 
for inverse Gaussian population for unbalanced design 
was considered. The models without interactions of 
factors were studied and applied to the problem of esti- 
mation of small area parameters in finite populations. 
Using Canadian survey data, synthetic populations were 
generated in a Monte Carlo study. Through this we 
demonstrated that the proposed estimators perform well 
under a variety of conditions when the population can 
be regarded as a random sample from some inverse 


Gaussian distribution. This approach offers a competitive 
choice for estimation of parameters in positively skewed 
survey data. 
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APPENDIX A 
Values of the Parameters for Generation of the IG Population 


w = 3.13241147 x 107°, o = 2.5447984 x 10~> 


3.1902855 | 2.8235779 | 1.5676078 | .8056079 | —.95350458 


inne nee eg es a ee ae ee 
— 4.0661125 


.49944356 .0061694263 — 2.7414128 — 1.1316622 
twasiigsiintlataninang | 9. | incense gee aC a 


1.0938451 36781639 | —.012707035 | —.11561414 | —.30936835 | — 1.023972 


Ogg values: 


— 


22,000.82 26,183.11 29,080.48 OWT Se: 31,826.13 41,195.19 


22,179.76 26,436.94 29,393.94 30,310.79 32,201.96 41,827.05 


22, Olds 33 27,344.90 30,520.70 31,510.37 33555925 44,146.20 


23,219.00 27,926.81 31,247.41 32,285.58 34,439.96 45,682.95 


24,207.76 29,369.63 33,064.91 34,229.61 36,661.02 49,674.90 


26,180.44 32,324.63 36,858.30 38,311.45 41,383.33 58,760.34 


23,385.24 28,167.65 31,549.24 32,607.90 34,806.97 46,330.96 


23,658.15 28,564.53 32,047.98 33,140.96 35,415.03 47,414.57 


oO | ao; aI ni ny] Pi _ wl]ln 


25,302.90 30997031 35,142.43 36,461.01 39,232.58 54,516.76 


_ 
j=) 


24,312.62 29,524.12 33,260.85 34,439.64 36,902.04 50,118.45 


APPENDIX B 
Values of the Cell Sizes Na, 


OO} mal NIT nD] wr] HR] wl] vd 


_ 
i=) 
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A Comparison of Some Weighting Adjustment Methods 
for Panel Nonresponse 


LOU RIZZO, GRAHAM KALTON and J. MICHAEL BRICK! 


ABSTRACT 


In some surveys, many auxiliary variables are available for respondents and nonrespondents for use in nonresponse 
adjustment. One decision that arises is how to select which of the auxiliary variables should be used for this purpose 
and another decision involves how the selected variables should be used. Several approaches to forming weighting 
adjustments for nonresponse are considered in this research. The methods include those based on logistic regression 
models, categorical search algorithms, and generalized raking. These methods are applied to adjust for panel 
nonresponse in the Survey of Income and Program Participation (SIPP). The estimates from the alternative 
adjustments are assessed by comparing them to one another and to benchmark estimates from other sources. 


KEY WORDS: Nonresponse bias; Panel surveys; Generalized raking; Benchmark estimates. 


1. INTRODUCTION 


Weights are commonly used in the analysis of survey 
data to compensate for unequal selection probabilities of 
the sampled elements, to compensate for unit nonresponse, 
and to make the weighted sample distributions for certain 
variables conform to known population distributions for 
those variables (thereby aiming to compensate for non- 
coverage and to improve the precision of the survey 
estimates) (Kish 1992). Corresponding to these three objec- 
tives, the weights are usually developed in three stages. 
First, a base weight is calculated for each sampled element 
as the inverse of the element’s selection probability. 
Second, the base weights of responding sampling elements 
are multiplied by a nonresponse weight to compensate for 
the nonrespondents. Third, the adjusted weight is modified 
to make the weighted sample distributions for certain 
variables conform to external information on these 
distributions. 

This paper deals with the nonresponse adjustment 
weights that attempt to compensate for unit nonresponse. 
A commonly used procedure for obtaining these weights 
is to divide the total sample into a set of weighting classes 
based on information known for both respondents and 
nonrespondents, and then to increase the base weights for 
the respondents in a weighting class to represent the non- 
respondents in that class (Oh and Scheuren 1983; Kalton 
1983). In many surveys little information is known about 
the nonrespondents, beyond the primary sampling units 
and strata from which they come. In this case, the choice 
of possible weighting classes is limited, and the procedure 
can be applied fairly straightforwardly. 

In some surveys, however, there is an extensive amount 
of information available for the nonrespondents. This 
information may be available from the sampling frame 


(e.g., when sampling employees from personnel files) or 
by matching sampled elements with administrative records. 
Also, in panel surveys and other surveys involving more 
than one stage of data collection, extensive information 
on nonrespondents at later stages is available from their 
responses at the early stages. 

The major focus of this research is on methods for 
developing weighting adjustments for nonresponse when 
a large number of characteristics of the nonrespondents 
are known. In this situation, decisions about methods of 
adjusting for nonresponse involve selecting which aux- 
iliary variables will be used and how they will be used to 
make the adjustments. 

The main ideas are presented in this article by applying 
several different adjustment procedures in a specific panel 
survey, the Survey of Income and Program Participation 
(SIPP). The SIPP is an ongoing household panel survey 
conducted by the U.S. Bureau of the Census. The non- 
respondents to a SIPP panel can be separated in two 
groups: those who fail to respond at the initial wave of 
data collection (initial wave nonrespondents), and those 
who respond at the initial wave but fail to respond at one 
or more of the subsequent waves of the panel for which 
they are eligible (panel nonrespondents). For the latter 
group, extensive information from the initial wave of data 
collection can be utilized in adjusting for panel non- 
response. The weighting adjustments studied here relate 
to the panel nonrespondents only. These adjustments 
modify the weights of panel respondents (i.e., those who 
provide data for all waves for which they are eligible) to 
compensate for the panel nonrespondents. 

In the SIPP, a national probability sample of house- 
holds is interviewed each year, and all the adults aged 15 
and over living in those households at the initial wave 
become panel members who are followed for the duration 
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of the panel. Until now SIPP panels have had a lifetime 
of 2% years, but this is being increased with the 1996 panel 
to 4 years. Interviews are conducted with panel members at 
four-month intervals to collect data about income amounts 
received, participation in income maintenance programs, 
and other factors that may affect their income and economic 
welfare. Data are also collected about children. See Nelson, 
McMillen and Kasprzyk (1985) and Jabine, King and Petroni 
(1990) for further information on the SIPP design. 

The investigation reported here was conducted with the 
1987 SIPP panel, using the panel’s public use data file. 
That panel started with a sample of about 12,300 house- 
holds and followed panel members for seven waves of data 
collection. The household nonresponse rate at the initial 
wave was 6.7 percent (Jabine et al. 1990). Including 
children, 30,841 individuals were living in the responding 
households at the initial wave. Of these individuals, 
20.8 percent failed to provide data for all waves for which 
they were eligible, i.e., they were panel nonrespondents. 

In addition to selecting auxiliary variables and studying 
alternative methods of using those variables to form 
weighting adjustments for panel nonresponse, this research 
includes a comparative evaluation of the procedures. The 
evaluation is performed by comparing a range of estimates 
produced with the alternative methodologies with one 
another and with benchmark estimates. The final section 
of this article summarizes the results and draws conclusions 
about the effectiveness of the alternative weighting schemes 
investigated. Further details are given by Rizzo, Kalton, 
and Brick (1994). 


2. PREDICTORS OF RESPONSE 
PROPENSITY 


The first step in developing panel nonresponse adjust- 
ments is deciding which of the large number of items 
available from the first wave of data collection should be 
selected for use in the adjustment procedures. That selection 
is the focus of this section. The approach adopted is to 
choose items with responses that discriminate persons by 
their likelihood to respond at all later waves. Little (1986) 
calls this method a response propensity stratification 
method and shows that the large sample bias of estimates 
can be reduced by adjusting the base weight by the inverse 
of the probability that an element responds. 

In the 1987 SIPP panel, there were 58 items available 
from the initial wave of data collection (Wave 1) that could 
be used as potential explanatory variables for panel non- 
response. All of the items used currently by the Bureau of 
the Census for the SIPP panel nonresponse adjustment 
were part of this set of 58, with the exception of the 
Metropolitan Statistical Area (MSA) status, which was 
suppressed from the public use data file because of disclo- 
sure concerns. 


With panel response status (panel respondent vs. panel 
nonrespondent) as the dependent variable, logistic regres- 
sion analysis was viewed as a natural method for selecting 
a model for panel nonresponse. However, before attempt- 
ing this modeling, an initial screening of the variables was 
performed to reduce the large number of variables to a 
more manageable set. As a general guideline, items were 
retained for the logistic regression analysis if the difference 
in response rates between any two categories for the item 
was both statistically significant and at least four percent- 
age points. For a variety of reasons, some items were 
retained even if they did not meet these requirements. For 
example, the difference in the panel response rates for 
males and females was less than 2 percent, but gender was 
nevertheless used in some subsequent analyses. 

The screening process reduced the number of items for 
the logistic regression analysis from 58 to 31. The items 
retained were: tenure, public housing, household type, 
Census region, household education, household size, 
household income, whether householder holds financial 
instruments (bonds), gender, race, Hispanic origin, rela- 
tionship to reference person (RRP), age, marital status, 
family type, education, student status, whether laid off 
work, personal income, whether holds multiple jobs, 
working class, whether a recipient of Medicare benefits, 
Medicaid, Women, Infants, and Children (WIC), Aid to 
Families with Dependent Children (AFDC), food stamps, 
general assistance, Social Security, other welfare, Veteran’s 
status, and the number of imputed items at Wave 1. 

The last item, the number of imputed items, was 
included as an index of cooperation at Wave 1. Other 
studies have found that individuals who are less coopera- 
tive at the initial wave of a panel survey are more likely 
to be nonrespondents at later waves (see, for example, 
Kalton, Lepkowski, Montanari and Maligalig 1990). As 
described below, this index turned out to be highly related 
to panel nonresponse. 


2.1 Logistic Regression Analysis 


Since all 31 items identified in the screening analysis 
were at least marginally correlated with panel nonresponse, 
they are all candidate variables for use in a weighting 
adjustment scheme to reduce the panel nonresponse bias 
in the survey estimates. However, the screening analysis 
was limited because it did not consider the interrelation- 
ships between the items and it retained too many variables 
for practical use in making the panel nonresponse adjust- 
ments. For example, two items that are highly associated 
with response status might also be highly correlated with 
each other, so that the use of one of the two might be 
sufficient in making the adjustments. To address this issue, 
the next step in selecting predictors of panel nonresponse 
was to investigate which combinations of the items could 
best predict panel response status. 
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Table 1 
Parameter Estimates for the Logistic Regression Model 


: Parameter 
Predictors Hstiinate 
Intercept — 0.465 
Age Ge = 184.9, p-value < .0001). 

< 16 —0.179 

16-24 0.446 

25-50 0.187 

51-71 — 0.056 

>71 0.0 
Race (x* = 214.0, p-value < .0001). 

White —0.351 

Black 0.255 

Other 0.0 
RRP (x2 = 69.0, p-value < .0001). 

Family member —0.251 

Nonfamily member 0.0 
Census region (x = 327.3, p-value < .0001). 

New England 0.009 

Mid Atlantic 0.167 

South Atlantic 0.027 

East South Central —0.231 

North Central — 0.396 

Mountain/West South Central 0.425 

Pacific 0.0 
Tenure (x* = 207.2, p-value < .0001). 

Home owner — 0.154 

Renter 0.331 

Other 0.0 
Items imputed Oc = 434.2, p-value < .0001). 

0 — 0.626 

1 —0.244 

2 to 3 0.296 

= 3) 0.0 
Bond status (x7 = 97.1, p-value < .0001). 

No bonds 0.168 

Some bonds 0.0 
Layoff (x* = 33.4, p-value < .0001). 

Not laid off —0.179 

Laid off 0.0 
Food stamps (x? = 39.3, p-value < .0001). 

Not recipient —0.191 

Recipient 0.0 
Class of work (xg = 31.4, p-value < .0001). 

Business 0.100 

Other 0.103 

Government 0.0 
Education (x2 = 12.8, p-value = .0003). 

Last grade tenth or eleventh — 0.075 

Other 0.0 
Household income O? = 14.9, p-value = .0006). 

Less than $1,200/month 0.117 

$1,200-$8,000/month — 0.088 

Greater than $8,000/month 0.0 
Gender OC = 10.3, p-value = .0013). 

Male 0.047 

Female 0.0 
RRP-Age < 16 Interaction (x2 = 10.1, p-value = .0015). 

Family member, child 0.096 

Other 0.0 
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A logistic regression approach was used to the examine 
the joint relationships of several items with panel response 
status. The regression models were fitted using the Wave 1 
survey weights that accounted for unequal selection prob- 
abilities and initial wave nonresponse. After examining a 
number of possible models, a model with thirteen main- 
effect variables and one interaction term was selected as 
a reasonable representation of the data. 

Table 1 presents the parameter estimates for each level 
of each predictor variable in this model, together with 
Wald (x7) statistics for each predictor variable. The 
parameter value of the last level of each predictor variable 
(the benchmark level) is set to zero. The parameter esti- 
mates for the remaining levels of each predictor variable 
represent differences in response propensity from the 
benchmark level. As can be seen from the Wald statistics, 
all the predictor variables make highly significant contri- 
butions to the model. 

A notable feature of this model is that it contains only 
one interaction term, the relationship to reference person/ 
age under 16 interaction. All other interactions investigated 
had smaller x? values than this one. Even the relationship 
to reference person/age under 16 interaction has a rela- 
tively low predictive power. In fact, this interaction and 
the last three predictor variables in Table 1 (education, 
household income, and gender) were not included in most 
of the weighting procedures discussed below because of 
their limited predictive power for panel response status. 
The weighting procedures are mostly based on a reduced 
main-effects model comprising the first ten predictor 
variables listed in Table 1. 


3. ALTERNATIVE WEIGHT 
ADJUSTMENTS 


The method used in the SIPP to adjust the weights for 
panel nonresponse is described by Chapman, Bailey, and 
Kasprzyk (1986). The method basically consists of forming 
nonresponse adjustment cells and then adjusting the 
weights by the inverses of the response rates in the cells. 
The cells are formed by the cross-classification of the 
responses from a set of Wave | variables thought to be 
correlated with panel response. Small cells are combined 
so that the resulting sample size in each collapsed cell is 
30 or more. The reciprocal of the observed (weighted) 
response rate in each collapsed cell is the panel non- 
response adjustment for that cell. The panel nonresponse 
adjustment is then multiplied by the Wave 1 weight to 
create a nonresponse adjusted weight. The Wave | weight 
includes an adjustment for Wave 1 nonresponse, but it 
does not include the Wave | poststratification adjustment. 

This section examines alternative methods for performing 
the panel nonresponse adjustments. These methods can 
be categorized into three groups: 
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e Logistic regression methods. 
e CHAID methods. 
© Generalized raking methods. 


Each of the alternative approaches to nonresponse 
adjustment is discussed below. The procedures for 
developing the weighting adjustments are detailed along 
with important statistical properties of the adjustments. 


3.1 Adjustments Based on Logistic Models 


The first set of weighting adjustments we discuss is 
developed directly from the logistic regression model 
described in the previous section. This panel nonresponse 
weighting adjustment, called the predicted logistic adjust- 
ment, was computed by taking the inverses of the response 
rates predicted from the reduced main-effects logistic 
regression model for each of the cells in the crossclassifica- 
tion of the ten predictor variables in that model. 

Since the parameters for computing the predicted 
response rates are estimated with a main-effects model 
from the marginal responses for the variables, the small 
sample sizes in the cells of the crossclassification of all 
the variables are not a concern. However, this benefit is 
gained by relying completely on the validity of the main- 
effects model, that is, by assuming that there are no inter- 
actions between the variables that need to be taken into 
account. 

One approach to placing less reliance on the main- 
effects model is to base the adjustments on the observed 
response rates in cells that have sample sizes large enough 
to ensure the stability of the observed response rates and 
to base the adjustments on the predicted response rates in 
other cells. The second member of the class of alternative 
adjustments based on logistic regression uses this mixed 
strategy. In cells containing 25 or more sample persons, 
the nonresponse adjustment is the inverse of the observed 
cell response rate. In cells containing less than 25 sample 
persons, the nonresponse adjustment is the inverse of the 
predicted response rate for the cell. This adjustment is 
called the mixed logistic adjustment. 

A third logistic nonresponse adjustment studied is 
similar to the current SIPP procedures. Initial cells were 
defined by the crossclassification of the ten independent 
variables used in the logistic regression. The cells were then 
collapsed until the sample size in each cell exceeded 30, and 
the inverse of the observed response rate within a collapsed 
cell was then used as the nonresponse adjustment. The 
strategy for collapsing cells was to group together cells with 
similar predicted response rates. This nonresponse adjust- 
ment is called the collapsed logistic adjustment. Although 
this adjustment is similar to the current SIPP panel 
nonresponse adjustment, there are some differences in the 
variables used to define the cells and the methods used to 
combine small cells are different. 


For all three alternative weighting adjustments based 
on the logistic regression model, the observed and 
predicted response rates were computed from weighted 
counts of the number of cases rather than using the un- 
weighted numbers, where the weights were the nonresponse 
adjusted Wave | weights. In practice, the weighted and 
unweighted adjustments were nearly the same. 


3.1.1 Adjustments Based on CHAID Models 


The second class of methods for adjusting for panel 
nonresponse involved using the CHAID categorical search 
algorithm to divide the data set into adjustment cells. The 
general approach was to define adjustment cells as combi- 
nations of responses to the predictor variables that had the 
greatest discrimination with respect to panel response 
rates, subject to the restriction that each cell should have 
a minimum sample size of at least 25 persons. The panel 
nonresponse adjustment was the inverse of the observed 
response rate in the cell. 

The CHAID algorithm creates cells by splitting the 
data set progressively in a tree structure. The splitting 
along each newly created branch is performed by choosing 
the variable that maximizes a x* criterion. When the 
split involves a polychotomous variable, the split may 
involve several branches. The x’ tests are modified using 
Bonferroni type adjustments to prevent variables from 
being chosen simply because they have more categories. 
CHAID is one version of the Automatic Interaction 
Detector (AID) developed for categorical variables. Kass 
(1980) presents the theory underlying the CHAID tech- 
nique. Another version of the same methodology was used 
by Lepkowski, Kalton and Kasprzyk (1989) and Kalton, 
Lepkowski and Lin (1985) to model nonresponse in SIPP. 

For the current analysis, two CHAID models were 
examined by including different sets of predictor variables. 
The first model included the seven most important predic- 
tors in the logistic regression model (age, relationship to 
reference person, race of householder, tenure, Census 
region, imputation flags, and bond-holding status), plus 
gender. This model resulted in 99 nonresponse adjustment 
cells. The nonresponse adjustment based on this model is 
called CHAID 1. The second CHAID model included the 
13 predictor variables from the logistic regression model 
presented in Table 1. This model resulted in 142 non- 
response adjustment cells. The nonresponse adjustment 
for this model is called CHAID 2. 


3.1.2 Adjustments Based on Generalized Raking 


The third class of methods examined for adjusting for 
panel nonresponse was. generalized raking. Unlike the 
other approaches, nonresponse adjustment cells were not 
developed by crossclassifying the predictor variables. 
Rather, raking was directly applied to force the panel 
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respondents’ marginal distributions for each of the pre- 
dictor variables (computed using the adjusted weights) to 
equal the corresponding distributions for respondents and 
nonrespondents combined (computed using the original 
Wave | weights). Kalton and Kasprzyk (1986) refer to this 
method as sample based raking. The ten predictor variables 
from the reduced logistic regression model were used to 
define the marginal distributions. Hence, the raking 
problem was ten dimensional, with one dimension for each 
predictor variable. 

Raking involves modifying the original weights in order 
to satisfy certain marginal constraints while minimizing 
the distance between the original and adjusted weights. 
Deville and Sarndal (1992) describe some distance functions 
that may be used and derive the corresponding raking 
methodologies. The raking algorithm of Deming and 
Stephan (1942), which implicitly employs a distance 
function that leads to a multiplicative solution, is one form 
of generalized raking. 

The CALMAR software described by Deville, Sarndal 
and Sautory (1993) was used to compute the adjustments. 
Three different distance functions were examined: the 
multiplicative method, the linear method, and the truncated 
multiplicative method. The adjustments for all three 
distance functions were found to be nearly identical. This 
empirical result is consistent with results given by Deville 
and Sarndal (1992) that show that the estimators using 
weights generated with different distance functions are 
asymptotically equivalent if the distance functions satisfy 
certain smoothness conditions. The three distance functions 
employed in this research satisfy those conditions. Since 
the adjustments were nearly identical for all three methods, 
only the weighting adjustment from the multiplicative 
method was retained for further evaluation. The resulting 
adjustment is called the raking adjustment. 
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3.1.3 Distributions of Nonresponse Adjustments 


The adjustments for each of the six schemes described 
above were computed for the 1987 SIPP panel file. Table 2 
summarizes the distributions of the resulting nonresponse 
adjustments. The summary is for the adjustments only, 
not the weights that are the products of the adjustments 
and the Wave | weights. Table 2 is divided into two 
parts: the upper part shows the mean, median, and 
extreme values for each adjustment distribution, as well 
as (1 + CV’), where CV is the coefficient of variation 
for each adjustment. The statistic (1 + CV7) serves as 
an indicator of the increase in variance of the estimates 
introduced by having variable nonresponse adjustment 
factors (see Kish 1992). The second part of Table 2 shows 
the correlations among the alternative forms of adjustment. 

Since the overall weighted panel response rate is 0.794, 
the mean overall nonresponse adjustment would be 
1/(0.794) = 1.26if the same adjustment were used for all 
persons. The mean weighting adjustments for the three 
weighting adjustments that use the inverses of cell response 
rates (collapsed logistic. CHAID 1 and CHAID 2) are 
necessarily equal to the overall nonresponse adjustment 
of 1.26. The mean weighting adjustments for the other 
schemes differ only minimally from the mean overall 
nonresponse adjustment. 

For all six schemes, the distributions are positively skewed, 
with a few cases with large weights. By their nature, the 
various logistic and CHAID schemes cannot have adjust- 
ments less than 1.00, whereas the raking algorithm can, 
and does, do so. The median weights are similar among all 
schemes, but the maximum weights are not. The CHAID 2 
scheme has a cell with a response rate of only 7 percent, 
leading to the largest maximum weight of 13.93. The raking 
scheme has the smallest maximum weight of 2.51. 


Table 2 
Distribution of Panel Nonresponse Adjustments 


Mean Minimum Median Maximum te cv 

Predicted logistic 1.26 1.04 1.20 4.28 1.02 
Mixed logistic 1.26 1.00 1.20 4.28 1.03 
Collapsed logistic 1.26 1.00 1.20 3.43 1.02 
CHAID 1 1.26 1.02 1°22 3.49 1.03 
CHAID 2 1.26 1.01 1.19 13.93 1.04 
Raking 1.26 0.91 1223 Del 1.02 
Correlations 

ee Mane otce Bae COHADSEd | © FA TD ll bce CHAID? © incRakine 

Logistic Logistic Logistic 
Predicted logistic 1.00 0.96 0.73 0.73 0.63 0.95 
Mixed logistic 1.00 0.73 0.72 0.63 0.90 
Collapsed logistic 1.00 0.69 0.58 0.75 
CHAID 1 1.00 0.81 0.73 
CHAID 2 1.00 0.63 
Raking 1.00 
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The values of (1 + CV’) are fairly consistent across 
the various adjustments. The CHAID 2 adjustment has 
the greatest value of (1 + CV’), primarily because of the 
presence of more outlying adjustments (such as the max- 
imum value of 13.93). However, even for this method, the 
approximate increase in the variance of the survey estimates 
is only four percent. The raking adjustment has the smallest 
increase in variance (two percent), but this increase is not 
very different from that of the other methods. 

The pairwise correlations between the six alternative sets 
of weights range from 0.58 to 0.96. Not surprisingly, the 
predicted logistic and mixed logistic weights are highly corre- 
lated. Given the similarity of the predicted main-effects 
logistic regression scheme to raking, it is also not surprising 
that their two sets of weights are highly correlated. The 
relatively high correlation between the raking weights and 
the CHAID 1 weight and the collapsed logistic weight is 
consistent with the earlier result showing no large interaction 
terms. The CHAID 2 weights have the lowest correlations 
with the other sets of weights, except for their correlation 
with the CHAID 1 weights. This finding is probably 
explained by the wide variability in the CHAID 2 weights 
resulting from the use of as many as 142 adjustment cells. 


3.2 Final Panel Weights 


The panel nonresponse adjustment weights discussed 
in the previous section represent the adjustments to the 
Wave | weights to compensate for panel nonresponse. The 
final panel weights that may be used in the analysis of the 
SIPP panel file are obtained by multiplying the panel 
nonresponse adjustment weights by the Wave 1 weights, 
and then applying poststratification to make weighted 
sample totals conform to totals derived primarily from the 
Current Population Survey (CPS). This procedure was 
applied for each of the six alternative panel nonresponse 
adjustment schemes. 
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The poststratification procedure used was equivalent 
to the current SIPP procedure, except that the latter 
procedure poststratifies by rotation groups whereas for 
the alternative weighting schemes the poststratification 
was performed on all rotation groups combined. The 
difference should not have an appreciable effect. After 
poststratification, the six alternative sets of final weights 
and the SIPP panel weights sum to the same control 
totals. 

To compare the final panel weights for the six adjust- 
ment schemes with one another and with the current SIPP 
panel weight, the correlations between the weights were 
computed, along with the measure of variability used 
previously, (1 + CV’). The results are presented in 
Table 3. The estimates of the variability due to the weight- 
ing (1 + CV’) indicate similar increases of between 8 
and 10 percent in the variances of survey estimates for all 
of the weighting schemes. The correlations between the 
alternative sets of final panel weights are all 0.85 or higher. 
Comparing these correlations to those in Table 2, it is 
clear that the correlations between the final weights are 
appreciably higher than those between the panel non- 
response adjustment weights. The correlations between the 
SIPP panel weight and the alternative final weights are 
consistently lower than any others, probably because the 
variables used in forming the nonresponse adjustments for 
this weight differed from those used for the alternative 
weights. The variables used in the alternative schemes that 
are not used in the SIPP panel weight are age, relationship 
to reference person, number of imputed items, class of 
work, and food stamp recipiency. Household size is the 
only variable other than MSA status (which was not 
available due to disclosure concerns) used in the SIPP 
panel weight but not used for the alternative schemes 
because it was not found to be significantly associated with 
response rates. 


Table 3 
Correlations Between Poststratified Weights with Variance Inflation Measures 


a 


Predicted Mixed Collapsed , 
SIPP panel Logistic Logistic p" Logistic CHAID1 CHAID2_ Raking 
SIPP panel 1.00 0.75 0.74 0.75 0.71 0.68 0.77 
Predicted logistic 1.00 0.99 0.91 0.90 0.86 0.98 
Mixed logistic 1.00 0.91 0.90 0.86 0.97 
Collapsed logistic 1.00 0.89 0.85 0.93 
CHAID 1 1.00 0.94 0.91 
CHAID 2 1.00 0.87 
Raking 1.00 
int CVG 1.08 1.09 1.09 1.08 1.09 1.10 1.08 
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4. COMPARING ESTIMATES USING 
ALTERNATIVE WEIGHTS 


The previous section described the development of the 
alternative sets of final weights that may be used for the 
analysis of the SIPP panel file. All the final weighting 
schemes incorporate adjustments for unequal selection 
probabilities, nonresponse at the initial wave, panel non- 
response, and poststratification to external control totals. 
This section compares survey estimates obtained using the 
alternative weighting schemes with one another and with 
the corresponding estimates obtained using the SIPP panel 
weights. In addition, where possible, the various survey 
estimates are also compared with external estimates from 
other sources. Some of the external estimates are bench- 
mark estimates obtained from administrative records or 
the Current Population Survey. Other external estimates 
are obtained from Wave | of the 1989 SIPP panel. Data 
collected in Wave 7 of the 1987 SIPP panel relate to the 
same time period as data collected in Wave 1 of the 1989 
SIPP panel, and hence estimates obtained from these two 
data sources should be comparable. 

In making comparisons with benchmark estimates, it 
needs to be recognized any differences observed may be 
explained by a variety of factors of which panel non- 
response is only one. For example, response errors and 
differences in definitions may explain differences between 
SIPP estimates and benchmark estimates. Thus the bench- 
mark comparisons need to be treated with caution. Since 
the 1989 SIPP panel estimates are based on Wave | data, 
they are not subject to the panel nonresponse. Thus, 
differences between estimates obtained from the 1987 and 
1989 SIPP panels are perhaps the most likely to be caused 
by a failure of the panel nonresponse adjustments to fully 
compensate for panel nonresponse bias. However, even 
in this case, alternative explanations such as panel condi- 
tioning could contribute to the differences (although 
Pennell and Lepkowski 1992, show that panel conditioning 
is not a major factor in most SIPP estimates). 

Table 4 presents a variety of estimates from the 1987 
SIPP panel file using the SIPP panel weight and the six 
alternative weighting schemes, and corresponding bench- 
mark estimates and estimates from the 1989 SIPP panel 
where available. The estimates are percentages, except for 
the estimates of the mean number of months without 
health insurance, median household income, and annual 
wages. The estimates are for the total population, except 
for the employment estimates (percent employed, un- 
employed and out of the labor force), which are for 
persons over the age of 15, and for annual wages, which 
are for persons over the age of 14. The estimates are for 
three different time periods: June 1987, January 1989, and 
the calendar year of 1987. For example, the first three 
estimates in Table 4 are the estimated percentages of 
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persons participating in the AFDC (Aid for Families with 
Dependent Children) program in June 1987, in January 
1989, and at any time during the 1987 calendar year. A 
comparable estimate from the 1989 SIPP panel is available 
only for the January 1989 time period. 

The most notable finding from Table 4 is the similarity 
of the estimates computed with all the weighting schemes 
from the 1987 panel. The percentage estimates in Table 4 
are in fact given to two decimal places because the use of 
the conventional one decimal place would often show no 
difference between the alternative estimates. The largest 
difference occurs for the percentage employed in January 
1989, where the estimate using the SIPP panel weight is 
62.7 percent and the estimate using the mixed logistic 
regression weight is 62.3 percent. Even this largest of 
differences is relatively small, especially when considering 
that the estimated standard error for this estimate is 
0.3 percent. 

When the 1987 SIPP panel estimates are compared with 
the external estimates from the 1989 SIPP panel and from 
other sources, some of the differences are much larger 
and of substantive importance. To examine these differ- 
ences in more detail, standardized differences between the 
alternative estimates and the benchmark estimates were 
computed and are shown in Table 5. A standardized 
difference is defined as the difference between the alter- 
native estimate and the external estimate divided by the 
standard error of the difference. 

The upper part of Table 5 shows the standardized 
differences when the 1989 SIPP panel is used to produce 
the external estimate. The standardized differences for 
most of the estimates are less than 2.0 in absolute value, 
indicating that the differences may be accounted for by 
sampling error. However, the standardized differences for 
the percentage unemployed and for the poverty rate are 
greater than 2.0 and highly significant. Thus, the alter- 
native weighting adjustments do not succeed in bringing 
the 1987 survey estimates in line with the 1989 survey 
estimates for all characteristics. 

The lower part of Table 5 shows the standardized 
differences when other benchmark estimates are used. 
These standardized differences are generally large and in 
many cases very large. Only a few are less than 2.0 and 
many are greater than 10.0. Given the much smaller 
standardized differences found in the upper part of Table 5 
for similar statistics, it seems likely that factors other than 
panel nonresponse bias are largely responsible for the 
magnitude of these differences. The standardized differ- 
ences based on these largely administrative data sources 
may signal important issues related to the quality of the 
data (from either the SIPP, the benchmark data source, 
or both), but they do not provide much help in assessing 
the effectiveness of alternative nonresponse adjustments 
in reducing panel nonresponse bias. 
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Table 4 


Estimates for the Total Population from the 1987 SIPP Panel with Alternative Weighting Schemes 
and Estimates from Other Sources 


SIPP Predicted Mixed Collapsed 


Panel Logistic Logistic Logistic Oe a ene 


AFDC - June 1987 3n3 3.70 3.74 35712 
AFDC - January 1989 3.10 Sal2 3.14 3), 12 
AFDC - Annual 1987 4.85 4.78 4.82 4.81 


Food stamps - June 1987 7.43 7.26 7230 7.34 7.38 VAY 721 1350 
Food stamps - January 1989 6.71 6.63 6.67 6.64 6.70 6.59 6.58 6.30 7.294 
Food stamps - Annual 1987 10.30 10.11 10.16 10.18 10.24 10.05 10.06 

Medicaid - January 1989 6.77 6.78 6.81 6.75 6.81 6.68 6.76 6.97 
Medicaid - Annual 1987 9.21 9.21 9.24 9.21 9.25 9.09 9.21 


SSI - June 1987 1.68 1.70 1.69 1.67 1.69 1.65 1.69 1.68° 
SSI - January 1989 1.65 1.67 1.66 1.64 1.66 1.61 1.66 1.65 1.747 
SSI - Annual 1987 1.80 1.82 1.82 1.80 1.82 1.78 1.82 


Social security - January 1989 14.92 14.87 14.87 14.89 14.88 14.89 14.85 


Poverty rate - June 1987 10.88 10.75 10.79 10.76 10.79 10.69 10.74 
Poverty rate - January 1989 12.91 12.98 13.02 12297 1299 12.91 12293 
Entering poverty 1987/1988 De) 33) DSi 2.30 D229 13) 231 
Leaving poverty 1987/1988 2.69 2.63 2.64 2.60 202 2.63 2.63 


Mean months without health 
insurance — 1987 1.66 1.69 1.70 1.67 1.67 1.69 1.69 


Median household income - 
January 1989 2,601 2,600 2,597 2,607 2,607 2,607 2,602 2,550 


Annual wages 1987 


(in trillions) R93) 1.94 1293 1.94 1.94 1.94 1.94 2.224 


Employed - January 1989 62.74 62.36 62.34 62.43 62.42 62.52 62.42 61.60 
Unemployed - January 1989 325i; 3.64 3.63 3.60 3.58 3.60 3.63 4.52 


Out of labor force - 


January 1989 33.69 34.01 34.03 33.96 34.01 33.88 33.95 33.88 


Married in 1987 
Divorced in 1987 0.51 0.50 0.50 0.49 0.50 0.51 0.49 
Changed address in 1987 


0.90° 


! Social Security Bulletin, Volume 52, No. 3. 

2 Social Security Bulletin, Volume 51, No. 7. 

3 USDA Food and Nutrition Service, unpublished data. 

4 U.S. Bureau of the Census, Current Population Reports, Consumer Income, P-60, No. 174. 

5 National Center for Health Statistics: Vital Statistics of the U.S., 1987, Volume III, Marriage and Divorce, DHHS Pub. No. (PHS) 91-1103. 
© U.S. Bureau of the Census, Current Population Reports, Population Characteristics, P-20, No. 473. 
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Table 5 
Standardized Differences Between 1987 SIPP Panel Estimates and Benchmark Estimates 


Bench- : : 
SIPP Predicted Mixed Collapsed ; 
pene Panel Logistic Logistic ne Se i ee cee ak ine 
stimate 

1989 SIPP panel estimates 
AFDC 3.56 — 1.58 —1.52 — 1.43 —1.52 — 1.44 — 1.84 — 1.57 
Food stamps 6.30 1.02 0.82 0.92 0.86 1.01 0.73 0.69 
Medicaid 6.97 —0.50 —0.47 — 0.40 —0.53 — 0.39 —0.70 —0.51 
SSI 1.65 0.05 0.11 0.08 —0.03 0.07 —0.15 0.09 
Social Security 15.14 — 0.38 — 0.46 — 0.46 — 0.42 — 0.44 — 0.42 —0.50 
Poverty rate 14.46 —2.77 —2.64 —2.57 — 2.67 —2.63 —2.78 —2.74 
Median Income 2,550 2.05 2.01 1.89 2.30 2.30 2.29 2.09 
Employed 61.60 2.42 1.60 1.56 1.76 a2) 1.95 eS 
Unemployed 4.52 — 4.93 —4.59 —4.59 —4.76 — 4.90 —4.78 — 4.60 
Out of labor force 33.88 — 0.42 0.28 0.32 0.18 0.28 —0.01 0.15 
Other benchmark estimates 
AFDC - June 1987 4.28 —2.55 — 2.66 —2.49 —2.59 — 2.65 —3.14 —2.71 
AFDC - January 1989 4.24 —5.71 — 5.62 — 5.49 — 5.63 —5.51 — 6.10 —5.70 
Food stamps - June 1987 7.35 0.27 =0:31 — 0.16 — 0.04 0.11 = 0:50 — 0.48 
Food stamps - January 1989 7529. —2.04 —2.32 —2.17 — 2.26 — 2.06 —2.44 =2.50 
SSI - June 1987 1.68 0.00 0.13 0.08 — 0.03 0.08 —0.20 0.11 
SSI - January 1989 1.74 —0.57 — 0.48 —0.53 — 0.67 —0.54 — 0.84 —0.50 
Annual wages 1987 2.22 — 16.12 — 15.94 — 16.38 — 15.66 — 15.61 — 15.60 — 15.78 
Married in 1987 1.86 —5.11 — 4,93 — 4.98 —$.11 —5.10 —5.07 — 4.95 
Divorced in 1987 0.90 —7.15 — 7.37 — 7.36 — 7.40 — 7.32 — 7.20 — 7.40 
Changed address in 1987 17.99 — 11.49 — 10.50 — 10.51 — 10.80 — 10.42 — 10.40 — 10.49 


5. DISCUSSION 


Nonresponse weights are widely used to compensate for 
unit nonresponse in sample surveys. The basic requirement 
for this form of weighting is the availability of information 
on one or more auxiliary variables for both respondents 
and nonrespondents. In many surveys, this information 
is available for only a small number of auxiliary variables 
(such as the PSUs and strata from which the units were 
selected). In such surveys, the nonresponse weights can 
often be simply developed as weighting class adjustments 
for a set of classes based on the crosstabulation of the aux- 
iliary variables. 

There are, however, surveys in which data are available 
for a large number of auxiliary variables for possible use 
in developing nonresponse weights. This situation often 
applies when an administrative record system is used as 
the survey’s sampling frame, with all the information in 
the system then being available for use in making non- 
response adjustments. It also applies when the survey data 
collection is conducted in two or more phases (e.g., an 
initial screening interview followed by a detailed interview 
or some other form of data collection at a later time point) 
and when nonresponse adjustments are needed for later 


phases; in this case, data from prior phases of data collec- 
tion may be used in compensating for nonresponse at later 
phases. A similar situation applies in panel surveys when 
adjustments are required for nonresponse at later waves 
of the panel, as discussed in this paper. 

When a large number of auxiliary variables is available 
for all sampled units, two main choices need to be made. 
First, there is the choice of auxiliary variables to use in the 
adjustment. Second, there is the choice of the adjustment 
method to be applied. 

The basic approach adopted in this study for choosing 
the auxiliary variables for use in the nonresponse adjustment 
was to identify the set of variables that were good predictors 
of panel nonresponse. With so many auxiliary variables 
available, the first step was a screening procedure to eliminate 
variables that were found to have little association with 
the panel nonresponse rate. Then, logistic regression models 
using predictor variables remaining from the screening were 
examined to identify the set of variables to be retained for 
use in adjusting the weights. Whether the number of aux- 
iliary variables is reduced to a manageable set by this or 
some other approach (e.g., by using the CHAID algorithm), 
this reduction is likely to be a necessary first step when there 
are many potential auxiliary variables available. 
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After selecting the subset of auxiliary variables, a wide 
variety of methods exists for creating the nonresponse 
adjustments. We examined panel nonresponse adjustments 
based on logistic regression models, categorical search 
models, and sample-based generalized raking. The final 
panel weights resulting from these adjustment schemes 
were highly correlated with one another and they yielded 
estimates that were very similar. None of the schemes 
produced estimates that were superior in terms of bias 
reduction. 

In part, the high correlation of the final panel weights 
generated by the different adjustment schemes may be 
explained by the similarity of many of the adjustment 
schemes. In part, it may be explained by the final post- 
stratification weighting which raised the correlations 
between the weights. It may also be partly explained by 
the lack of large interaction effects between the auxiliary 
variables. If there were sizable interaction effects that were 
not included in the logistic modeling, then one might 
expect greater differences between the raking and predicted 
logistic weights on the one hand and the CHAID, mixed 
logistic, and collapsed logistic weights on the other hand. 
Thus, the similarity in weights produced by the alternative 
weighting schemes for the SIPP may not be as great in 
other circumstances. 

A common concern that arises when many auxiliary 
variables are used to adjust the weights is that the adjusted 
weights might be highly variable, thus causing a serious 
loss of precision in the survey estimates. This proved not 
to be the case in the methods we evaluated. The variability 
of the weights with all the weighting schemes turned out 
to be similar, provided reasonable precautions were taken 
in creating the adjustments. 

Although the empirical results do not show any appre- 
ciable differences in the estimates produced using the alter- 
native weighting schemes and those produced using the 
SIPP panel weights, the correlations of the alternative 
adjusted weights and the current SIPP panel weight were 
found to be lower than the correlations among the alter- 
native weights. This finding suggests that the choice of 
auxiliary variables is an important one, and probably more 
important than the choice of the weighting methodology. 
Although the more systematic methods used in this research 
for choosing the auxiliary variables did not result in major 
improvements over the current SIPP procedures, an 
analytic based choice of auxiliary variables may be more 
productive in other studies. 

When a sizable number of auxiliary variables that are 
correlated to response propensity is available, it seems wise 
to use as many of them as possible in the nonresponse 
adjustment to serve as a safeguard in attempting to com- 
pensate for nonresponse bias. This general strategy should, 
however, be tempered by a careful assessment of the 
variation of the resulting weights in order to avoid too 
great a loss of precision in the survey estimates. In addition, 


a practical consideration that should be taken into account 
is the ease of implementation of the weighting method- 
ology. If, as in this study, alternative weighting method- 
ologies yield very similar weights and estimates, a method 
that is simple to apply may be preferable. 
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Multiple Sample Estimation of Population and Census Undercount 
in the Presence of Matching Errors 


YE DING and STEPHEN E. FIENBERG! 


ABSTRACT 


The multiple capture-recapture census is reconsidered by relaxing the traditional perfect matching assumption. We 
propose matching error models to characterize error-prone matching mechanisms. The observed data take the form 
of an incomplete aS contingency table with one missing cell and follow a multinomial distribution. We develop a 
procedure for the estimation of the population size. Our approach applies to both standard log-linear models for 
contingency tables and log-linear models for heterogeneity of catchability. We illustrate the method and estimation 
using a 1988 dress rehearsal study for the 1990 census conducted by the U.S. Bureau of the Census. 


KEY WORDS: Capture-recapture census; Estimates for total population size; Log-linear models; Matching errors; 


Multiple recapture census. 


1. INTRODUCTION 


The multiple recapture census technique has been used in 
many fields to estimate the size of a closed population. 
Cormack (1968) and Seber (1982) give excellent reviews of 
many techniques used. Here we consider a sequence of 
samples, s), .. ., 5,, where the members of i-th sample are 
uniquely labeled, for example, by tagging or marking, and then 
returned to the population (Darroch 1958). Usual multiple 
recapture census methods make the following assumptions. 
(1) Perfect matching. Individuals in one list (information 

source, sample) can be matched with those in another 
list without error. In other words, there are no mis- 
classification errors with respect to determining whether 
a particular individual has been recorded by both 
information sources or only one of them. 

(2) Independence. The lists are independent of one another, 
that is, the probability of an individual being included 
in one list does not depend on whether the individual 
was included in previous lists. 

(3) Homogeneity (Equal Catchability). All individuals in 
the population under study have equal probabilities of 
being observed (captured) in any list (sample). 

(4) Closure. The population in question is ‘‘closed’’, so 
that there are no changes due to birth, death, emi- 
gration, or immigration during the period when the 
sampling takes place. 

Darroch (1958) examined the multiple recapture census 
under these four assumptions. Fienberg (1972) adopted a 
log-linear model approach to allow for statistical dependence 
of specific types among samples, thereby dropping the 
independence assumption. Darroch, Fienberg, Glonek and 
Junker (1993) developed an extended log-linear model 


approach that allows for individual-level heterogeneity as 
well as dependence, but it requires at least three samples, 
i.e.,k = 3. Inthe context of the two-sample census approach 
used by U.S. Bureau of Census for census coverage evalua- 
tion, matching problems due to unavoidable mismatches 
and erroneous nonmatches have been explored by several 
authors. For example, Ding and Fienberg (1994) considered 
modeling matching errors in the two-sample census and 
developed systematic procedure for the estimation of popula- 
tion totals. The inclusion of a third sample, e.g., drawn from 
the administrative records, in modeling and estimation of 
census coverage has been considered by the U.S. Bureau of 
Census in the past and remains an option to augment and 
evaluate the dual system approach. In this paper, we consider 
matching error models for the multiple sample census 
problem, allowing for both dependence and heterogeneity. 

Here we view the observations from a multiple recapture 
census data as falling into a 2* cross-classification, with 
absence or presence on the i-th sample defining the category 
for the i-th dimension. In this cross-classification, the cell 
corresponding to absence for all k samples is missing. The 
objective is to estimate the number of individuals in the 
population who are not observed, which corresponds to 
the missing cell in the 2* incomplete contingency table. In 
Section 2, we investigate the effects of matching errors on 
the observed 2* incomplete table. In Section 3, some 
models for matching errors are proposed to characterize an 
error-prone matching process. Based on these models and 
assumptions (3) and (4), we develop a procedure using log- 
linear model formulation for the estimation of the population 
size. In Section 5, we use the proposed methods to analyze 
data from 1988 Dress Rehearsal Census conducted by the 
U.S. Bureau of Census. 


Le Ding, Research Scientist, Bureau of Biometrics, New York State Health Department, Concourse, Room C-144, Empire State Plaza, Albany, 
New York 12237, U.S.A.; Stephen E. Fienberg, Maurice Falk Professor of Statistics and Social Science, Department of Statistics, Carnegie Mellon 
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2. MATCHING ERRORS IN MULTIPLE 
SAMPLE CENSUS 


We begin by classifying matching errors into two broad 
categories, mismatches and erroneous nonmatches. To 
understand the nature of matching errors in multiple- 
sample census, we review the case of a three-sample 
census. Suppose that there are no missing data or errors 
in recording the information for any individual in the 
population and one takes three samples from the popula- 
tion, s}, 8), and s;. For instance, suppose that, in sample 
S}, individuals 1, 3, 4 and 7 are seen, individuals 3, 4, and 8 
are seen in Sy, and individuals 4, 9, and 10 in s3. In vector 
notation, we can represent this as s, = (1, 3, 4, 7), 
S, = (3, 4, 8) and s; = (4, 9, 10). Matching errors are 
not present provided that there is complete and correct 
information available. We thus have the following incom- 
plete 2° table corresponding to these three samples: 


Table 1 
Original Table without Matching Errors 


Sy} 
Present Absent 
S9 S9 
53 Present Absent Present Absent 
Present 1 0 0 2 
Absent 1 2 - 


Suppose further that, because of missing data or 
incorrect information, we actually observe 


Sy is di, 25 4, 7). S> = G*, 4*, 8), S3 = (4, 9; 10), 


where 3* and 4* are individuals 3 and 4 but with incorrect 
information leading to two erroneous nonmatches when 
the samples are matched. Assuming no erroneous matches, 
we then observe the incomplete 2? table: 


Table 2 
Observed Table with Matching Errors 


S} 
Present Absent 
S2 SQ 
S3 Present Absent Present Absent 
Present 0 1 0 Dy 
Absent 0 3 - 


The effects of matching errors are obvious from a 
comparison of Table 1 and 2: 


(i) The number of observations may increase for some cells 
while decreasing for the others, and as a consequence, 
the marginal totals and especially the total number 
of different individuals observed in the three samples 
may change, subject to the constraint that the total 
number of observations in each sample, x;4 4,414; 
and x, ,, remain the same. Changes in the total 
number of different individuals in all samples make 
our problem distinct from the usual misclassification 
problem in the analysis of categorical data, in which 
the possibility of making mistakes in classifying indi- 
viduals into respective categories is considered. (e.g., 
see Chen 1979). 


(ii) In parallel, there may be changes in some cell proba- 
bilities subject to the constraint that the probability of 
being captured in a sample, p;.4, 241+, and pi;4, 
is unchanged. 


Because of the complexity of matching errors in the 
three-sample case, we need some special terminology 
for descriptive convenience. We say that an individual is 
at state 1 with respect to sample s, if the individual is 
observed in s, and at state 0 if not. We use a triple (i,j,k), 
0 < i,j,k < 1, to denote an individual at state i, 7, and 
k with respect to s,, 5, and s3, respectively. For instance, 
(1,0,0) is an individual observed only in s;, and (1,1,1) is 
an individual captured in three samples. We define the 
level of an individual (i,j,k) asit+ j+k, i.e., the 
number of samples in which the individual is included. 
There are four different levels, 0, 1, 2 and 3. An individual 
has level 0 if and only if he/she is not captured by any 
sample, and has level 3 if he/she is in three samples. For 
a (1,1,0) individual, if the correct match is not made 
according to the matching rule, this individual decomposes 
into ‘‘two different’’ individuals, a (1,0,0) and a (0,1,0), 
assuming no erroneous matches. On the other hand, a 
(1,0,0) individual matched incorrectly with a (0,1,0) will 
produce a single observed (1,1,0) individual. For conve- 
nience, we call such a decomposition or combination a 
transition. Then transitions can only go from level 3 or 2 
to the same (if there is no matching error) or lower levels 
in the absence of erroneous matches. More specifically, a 
(1,1,1) person may make a transition into one of 5 possible 
sets of individuals 


{(1,1,1)}, {(1,0,0), 0,1,1)},  {(0,1,0), (1,0,1)} 


{(0,0,1), (1,1,0)},  {(1,0,0), (0,1,0), (0,0,1)}. 


For level 2 individuals, (1,1,0) can decompose into 
{(1,0,0),(0,1,0)} or stay at {(1,1,0)}, and similarly for 
{(0,1,1)} and {(1,0,1)}. From above discussions, we 
summarize the effect of matching errors by the following 
diagram: 
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Table 1| —> {Matching Process} —> | Table 2 


where Table 1 is the original 2* incomplete table with no 
matching errors and Table 2 is the observed 2* incomplete 
table in the presence of matching errors. Henceforth, we 
denote the cell probabilities and expected cell counts 
associated with Table 1 by {rj} and {/j;,} and those of 
Table 2 by {pix}, {mix}, for 1 = i,j,k < 2. 


3. SOME MODELS FOR MATCHING 
ERRORS 


We now propose models to describe the matching 
errors, each of which allows us to formulate the realloca- 
tion of cell probabilities and expected cell counts associated 
with Table 1. 

Model (1). In addition to the homogeneity and closure 
assumptions in §1, we assume that: (i) There are no 
erroneous matches in the matching process; (ii) Any indi- 
vidual will stay at his original state with probability 0, and 
transition to any of a possible set of individuals with 
probability (1 — 6)/(m — 1), where mis the number of 
all possible sets of individuals to which the individual may 
transition. For example, for a (1,1,1) person discussed late 
in last section, m = 5. 

Under this model, for the three-sample census, we can 
express the probabilities for the table with matching errors, 
{pijx}, in terms of probabilities of the table with no 
matching errors, {rjjx}: 


Pin = Ori; 


Pix = iy + O12, 
P12 = ri + Oy21, 
1-06 
Pari rit + Ori, 
1-06 
Pin = Ty + (1 — 8)ryy + C1 — @)r21 + M22, 


1-0 
P22 = riy + (LV —8)ryig + (1 @)ray + Poi, 


P22 = pajond 14 + (1 —6)roy, + (1 — 9)ria + Pra. 


(3 //, 
Let 
-_- es ih 
P = (Pitts Pii2> Pi21> P2its P122» P212» P22) 
and 
my T 
PF = (Fits T1129 Ti2ts T2119 11229 2125 7221) > 
then 
Paar (i) 


Here M, is a7 by 7 matrix determined by the above 
seven equations derived under Model (1).It is straight- 
forward to verify that the probability of catching any indi- 
vidual in each sample is fixed, i.e., pj44 = 144+ = Dis 
Psit+ = ai¢ = Po P+41 = 1441 = p3- This must be 
the case because the sample capture probabilities do not 
depend on how the matching mechanism operates. 

We can easily generalize this formulation to handle the 
k-sample case; however, the algebra involved is quite 
messy for large k. We can simplify this model by requiring 
that the transitions can go downwards by at most one level, 
thus yielding Model (2): 

Model (2). In addition to the homogeneity and closure 
assumptions in §1, we assume that: (i) there are no 
erroneous matches in the matching process; (ii) a transi- 
tions can only go downwards by at most one level; (iii) any 
individual will stay at his original state with probability 
6, and transition to any of a possible set of individuals with 
probability (1 — 6)/(m’ — 1), where m’ is the number 
of sets of individuals to which transitions are possible and 
allowed. 

We first consider the three-sample case. A (1,1,1) indi- 
vidual can decompose into three individuals, i.e., (1,1,1) 
{(1,0,0), (0,1,0), (0,0,1)} (we use ‘‘+”’ to denote for 
decomposition), if three presumed matches are not made. 
Assumption (ii) of Model (2) assumes that this triple error 
has negligible probability when compared with the tran- 
sition in which only one of the matches is not made so that 
(1,1,1) — {(1,1,0),(0,0,1)}, or 1,1) ~ ((1,0,1),(0,1,0)}, 
or (1,1,1) — {(1,1,0),(0,0,1) }. 

For three sample case, the parametric model for 
expressing {pj;;,} in terms of {rj} is: 


Pin = 911; 


Piz = ria t+ Ori, 
1-06 

Por Tiiy + Ori, 

pn = Tia + Oro, 


58 Ding and Fienberg: Multiple Sample Estimation of Population and Census Undercount 


0 
Ti ttuates Cle 0) Faipete () —10 )hipn, 4 ings 


ip = 
— 6 
Pr12 = An de) rial =O) meer: 
— 06 
P21 = rey Cl Oro + Cl Olio oon. 
Then 


> 


p= M, x7, (2) 


where M, is a7 by 7 matrix determined by the above seven 
equations derived under Model (2). Again, the capture 
probabilities are unchanged, i.e., pj44 = 144 = Dip 
OE Se iets Ot oy = Irae] Jays 

For the k-sample problem, let pj be the probability of 
being captured in all samples, i.e., pj = p11;..1, and let 
P7,3(h;,h) be the cell probability corresponding to absence 
in the /,-th, and h>-th sample and presence in the others, 
etc. Under Model (2), we have pj = Or;. Fori < k — 2, 
the probability of being missed by the /,-th, h5-th, ..., 
and h;-th sample and captured by the others is 


PRS (hits) =O. 


i 
R—i+1 TU2({Ai,ho,..shi\hj)* 


For i = k — 1, the individual is included in only one 
sample. For example, the probability of being captured 
only by the first sample is 


Dis ss re.) oe Tyi(n),2 + 


h#\ 
at's, 
3 Do tis 2 + 
hy,ho=2 
k-1 Get) 


where 7}, 1 (4,,/2,...,h;),2 18 the cell probability in the original 
table which corresponds to presence in the first, /,-th, 
hy-th, ..., Aj-th sample and absence in the others. By 
symmetry, we can write down the expression for p;(,),3, 
the probability of being observed in the 4-th sample only 
and missed in all others. 

We can refine Model (2) by assuming unequal matching 
rates. For example, we consider two decompositions: 
(1,1,1) + {,1,0),(0,0,1)} and (1,1,0) > {(0,1,0),(1,0,0)}. 


It is common for both cases that one presumed match is 
not made. They differ in that one has two sources of infor- 
mation for that match while the other has only one. It is 
reasonable to assume different matching error probabilities 
for the two cases instead of a common one as proposed 
in Model (2). This leads to: 


Model (3). In addition to (4) and (iii) in Model (2), we 
assume 


(1,1,1) with probability a, 
iat {(1,1,0),(0,0,1)} with probability (1—a,)/3 
Cn {(0,1,1),(1,0,0)} with probability (1—a,)/3 


{(1,0,1),(0,1,0)} with probability (1—«a,)/3 


(1,1,0) with probability az 
(1,1,0) 


{(0,1,0),(1,0,0)} with probability 1—a, 


(1,0,1) with probability a 
(031) 


{(1,0,0),(0,0,1)} with probability 1—a, 


(0,1,1) with probability a 
(0,1,1) > 


{(0,1,0),(0,0,1)} with probability 1—a, 


and (1,0,0), (0,1,0), (0,0,1) stay the same with probability 
one. 


Under this model, we can express the cell probability 
{Pix} in Table 2 in terms of a, a and the cell proba- 
bilities of Table 1, {7;;,}. To do this, we need to consider 
all possible transitions that produce an individual that falls 
into the (i,j,k) cellin Table 2. For example, we consider 
an observed (1,0,0) individual. This person falls into cell 
(1,2,2) of Table 2. Let F be the event that an observed 
individual has a (1,0,0) status. Let Ejjx be the event that 
an individual falls into (i,j7,k) cell in Table 1. Then 


= U (Bij OF). 
{Lk} 


According to Model (3), there are only four possible 
transitions as follows that can make F happen: 


(tt) > ,0,0). Oy. 


(1,1,0) — {(1,0,0),(0,1,0)}, 


I 


(1,0,1) {(1,0,0),(0,0,1)}, 


I 


(1,0,0) {(1,0,0)}. 
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Therefore 
F= 
(Ein F)YU (22 F)U (EN F)U (Ein F). 
By the definitions of cell probabilities of the two tables, 
P(F) = Pin, and p(Ejx) = riz. By the assumptions 
in Model (3), p(F'| Fi1) = (1 — a) /3, p(F | Ey) = 
P(F | Ey21) = ay, and p(F'| Ej) = 1. 


Since Fiyy OF, E12 OF, Ey2,0F and Ei (\F are four 
mutually exclusive possibilities that F can happen, thus 


Pin = P(E OF) + p(Ey2M F) 


+ p(Ey,0F) + p(E\2M F) 


= D(F | Ey) + p(Ein) + PCF | Ey) + p(En2) 


+ p(F | Fy) + pP(E\21) + D(F'| Ey) + p( E122) 


age 
Nip + (1 = a)ryy2 + C1 = a) ry21 + P22. 


In the same manner, we can derive the expressions of 
other cell probabilities of Table 2 to get 


Pin = i115 
1 = 4) 
in = 3 Fyyy + @27}32, 
] = 8 4) 
Pini = 3 rit + 2721, 
1 pees 4 
Pau = 3 Ti + 27211, 
Lies 
Pi2 = rip + Cl = ay)ryyg + (1 — @)r121 + M2», 
298 
We ees CAIN al Oa) Tin +01 — o>) fou + Dp, 
1 amCX A 
P21 = iy + (1 — @g)ray + (1 — a) ria + 1201. 
Then 


Dp = M; x7, (3) 


where M; is a 7 by 7 matrix determined by the above 
seven equations derived under Model (3). 
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For a; = a, = 6, we get the same formulation as 
under Model (2). For the special case witha, = a) = 1, 
Pijk = Vij, Teducing to the traditional problem. Again, 
the capture probabilities remain the same, i.e., pj 44 = 


Ti44>P+i+ = Tei¢, Peer = F44- 


4. ESTIMATING THE SIZE OF THE 
POPULATION 


4.1 Log-linear Model Formulation 


For purposes of exposition, we confine our attention 
to the three-sample census case, although extensions to the 
k-sample census for k > 3 are straightforward. As before, 
let Jj, and mj; be expected cell counts for Table 1 and 
Table 2 respectively. The relationship between the cell 
probabilities and the expected cell counts is Jj, = rjjxN, 
and mj = PjjxN. Let 


a] T 
M = (My11,1N112,M421,M11,1N122,M12,M 21), 


and 


P= (tlie toptne pos) sigs) oe 


Since for each of the models we have proposed in the 
last section, there is a matrix M with entries depending on 
the matching probability parameters in the chosen model 
such that p = M x fF, multiplying through by N gives 


m=MxT. (4) 


For any log-linear model specified for Table 1, it is 
straightforward to obtain the parameterization for mix. 
For example, for any of the models suggested in Fienberg 
(1972), we can write the expected counts in terms of 
functions of u-term parameters: 


lijk = 
ijk (UU, (1), Un (J), U3 (K), U2 (W/), U3 (1K), U23(JK)), (5) 


and then obtain the parameterization of {mjjx, (ijk) # 
(222)} from (4). 


4.2 Estimating the Size of the Population 


We now consider the matching rates in our various 
models as known. To obtain the estimate of the population 
size, we proceed as follows. First, following Sanathanan 
(1972), we compute the maximum likelihood estimates of 
u-term parameters from /,, the conditional likelihood 
associated with Table 2 given n, 


1 mbonielell 


(ijn) “Ue 
apes olsen 
( (isk) #0222)) UK 
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where 1 = Yj (jx) 4222))Xises ANA ijn = Myx/n. Sanathanan 
(1972) shows that, under suitable regularity conditions, the 
conditional maximum likelihood estimates and the uncon- 
ditional ones are both consistent and have the same asymp- 
totic normal distribution. If we remove redundant u-term 
parameters using the constraints associated with the 
specified log-linear model for Table 1, then the problem 
is to find the maximum of /, subject to the following 
single constraint: 


»D Mi jx =A. 


{ (ijk) #(222)} 


Numerically, this is a nonlinearly constrained optimization 
problem. Rao (1957) studied regularity conditions under 
which there exist unique maximum likelihood estimates of 
the parameters in a multinomial distribution. His condi- 
tions are satisfied by the parameterization of {qj}. Once 
the conditional maximum likelihood estimates of the 
u-term parameters are obtained, we use the loglinear 
model specified for Table 1 to compute the conditional 
maximum likelihood estimates of {/;;,}, the expected cell 
counts of Table 1 including the expected count of the 
missing cell. Then our estimate of N is 


{ik} 
In the case of no matching errors, with a, = a, = lin 
Model (3), mijx = lijx. Thus 


N=n+ My, 


i.e., we get back to the estimation method for the tradi- 
tional multiple recapture census problem developed by 
Fienberg (1972) when the log-linear models in Fienberg 
(1972) are considered. 

As we have discussed earlier, a log-linear model is 
specified for Table 1 and the observations are viewed 
as falling into Table 2, whose parametric model of the 
expected cell counts is specified by the log-linear model 
and a chosen model for matching errors. To assess the 
appropriateness of a log-linear model specified for Table 1, 
we can apply the usual Pearson and likelihood ratio 
goodness-of-fit tests, X * and G’, discussed in Fienberg 
(1972), to Table 2. Each statistic has an asymptotic x7 
distribution under the null hypothesis that the model fits, 
with degrees of freedom equal to 2* — 1 — (number of 
independent parameters in the model). 


5. ANALYSIS OF 1988 ST. LOUIS DRESS 
REHEARSAL CENSUS DATA 


Dual System Estimation (DSE), based on the standard 
two-sample census, has been employed by U.S. Bureau of 
Census for census coverage evaluation since 1950. In 1988, 


the Census Bureau conducted a Dress Rehearsal Census 
for the 1990 decennial census at three sites: St. Louis, 
Missouri; Columbia, Missouri; and western Washington 
State. Zaslavsky and Wolfgang (1993) present data for a 
population subgroup from the Post Enumeration Survey 
(PES) in the dress rehearsal census in St. Louis which 
focuses on urban Black male adults who are believed to 
be underestimated by dual system methods. The resulting 
data consists of three sources: the C-sample is the census 
itself; the P-sample was compiled from the PES; a third 
source of information was the Administrative List Supple- 
ment (ALS), compiled from pre-census administrative 
records of state and federal government agencies, encom- 
passing Employment Security, driver’s license, Internal 
Revenue Service, Selective Service, and Veteran’s Admin- 
istrative records. The C-sample and P-sample provide data 
for the implementation of the usual DSE or capture 
recapture approach. The ALS data can be combined with 
the Census and the P-sample for analysis from a three- 
sample perspective, though it was originally intended to 
improve the coverage of the P-sample. In Table 3, we 
present three-sample data for PES sampling stratum 11 
in St. Louis obtained by collapsing the original data in 
Table 1 of Zaslavsky and Wolfgang (1993) over four 
poststrata defined by owners/renters x age 20-29, 30-44. 


Table 3 
Three-Sample Data for Stratum 11, St. Louis 


Census 
ALS Present Absent 
P-sample P-sample 
Present Absent Present Absent 
Present 300 51 53 180 


Absent 187 166 76 - 


Such triple-system data can be analyzed with the 
matching error Model (2) and data from a separate 
Matching Error Study (MES, or rematch study) associated 
with the same sampling poststratum. The MES is one of 
the operations conducted by the Census Bureau to evaluate 
the PES, and typically operates for a sample of cases, using 
more extensive procedures, highly qualified personnel and 
reinterviews to obtain estimates of the bias associated with 
the previous matching process. In the discussion of the 
Matching Error Study done in a 1986 test census in Los 
Angeles, Hogan and Wolter (1988) state that ‘“‘The rematch 
was done independently of the original match, and the 
discrepancies between the match and the rematch results 
are adjudicated. Because of this intensive approach to the 
rematch, we believe the rematch results represent true match 
status, while differences between the match and rematch 
results represent the bias in the original match results.”’ 
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Table 4 


St. Louis Rematch Study: P-sample 
Source: Mulry, Dajani and Biemer (1989) 


Rematch Classification 


Original 
Match 

ele OM. Not Un- 
Classification Matched ick se i? Total 
Matched 2,667 fl 8 2,682 
Not matched 9 427 30 466 
Unresolved 0 a 20 27 


Total 2,676 441 58 SelB} 


The data from the MES thus provides a basis for esti- 
mating error rates in the original matching process. Mulry, 
Dajani and Biemer (1989) report the MES operation for 
the 1988 Dress Rehearsal and rematch data for all three 
test sites, and in Table 4, we reproduce those data relevant 
for our purposes. 

Let a be the matching rate between the C-sample and 
the P-sample, andy = 1 — a bethenonmatch error rate. 
We assume no errors in the rematch. Then from the data 
in Table 4, we can estimate a by & = 2667/(2667 + 9) = 
99.6637%, and y by y = 1 — & = .3363%. The para- 
meter 6 is a three-sample matching rate for the C-sample, 
P-sample and the ALS. It takes two matches, say, one 
between the C-sample and the P-sample, and the other one 
between the P-sample and the ALS, in order to reach a 
correct (1,1,1) three-sample classification. In the absence 
of evaluation of the match between the census and the 
ALS, we assume that these two matches are independent 
of each other and that the matching rate for the P-sample 
and ALS is the same for the C-sample and the P-sample. 
Thus we can use 0 = a’, and 6 = & = 99.3285%. 
Based on other qualitative information, this seems to be 
unreasonably high match rate, and the match error rate 
for the census and the ALS is probably higher than the 
match error rate between the census and the P-sample. In 
the absence of better quantitative information, however, 
we proceed to use it in the calculations that follow. 


Table 5 
Estimates Under Various Models 


MLE Using Matching 


Log-linear Usual MLE Error Model (2) 
Model = = 

N(S.E.) Fit (d.f.) N (S.E.) Fit (d.f.) 
[C] [P] [A] 1091.48 (11.24) 248.31 (3) 1083.58 (10.93) 244.56 (3) 
[CP] [A] 1204.14 (23.31) 90.60 (2) 1194.73 (22.86) 87.30 (2) 
[PA] [C] 1108.34 (13.77) 247.93 (2) 1100.03 (13.40) 244.53 (2) 
[CA] [P] 1068.87 (10.47)  230.66(2) 1061.09 (10.10) 226.42 (2) 
[CP] [CA] 1271.11 (52.55) 87.16 (1) 1256.77 (50.97) 84.37 (1) 
[CP] [PA] 1598.88 (106.26) 17.55(1) 1585.03 (104.93) 15.88 (1) 
[CA] [PA] 1080.47 (13.38) 230.43 (1) 1072.19 (12.88) 226.44 (1) 


(CP][CA][PA] 2360.82 (363.25) — (0) 2309.55 (352.36) - (0) 
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Table 5 gives the estimates of the population size for 
various log-linear models with estimates of standard errors 
and goodness-of-fit statistics. Standard errors are computed 
with the delta method as discussed in Fienberg (1972). The 
assumption of independence between the census and the 
P-sample has been questioned for the use of the DSE. The 
dual system method has limited capacity to test this assump- 
tion and to adjust for potential dependency, while both 
can be handled through log-linear models for three or more 
samples. There are four models listed in Table 5 that assume 
independence between the census and the P-sample: the 
independence model [C] [P] [A], [PA] [C], [CA] [P], 
and [CA] [PA]. All of them fit the data poorly. The three 
models with the interaction term for the census and the 
P-sample, [CP] [A], [CP] [CA], and [CP] [PA] fit the 
data much better. With the addition of an interaction term 
linking the census and the ALS, model [CP] [CA] fits only 
slightly better than [CP] [A], indicating that the census 
and the P-sample are together nearly independent from 
the ALS. The model [CP] [PA] fits the data the best, 
suggesting that the usual independence assumption for the 
DSE is invalid and that there is dependence between the 
P-sample and the ALS. For all seven non-saturated log- 
linear models, we obtain better fits under matching error 
Model (2), though only slightly so, due to the high match 
rate for the data from the 1988 U.S. Census Dress Rehearsal. 
For the [CP] [PA] model, there is a .8738% difference 
in the estimate of N associated with the nonmatch rate of 
.3363%. If the nonmatch rate had been 10%, i.e., a90% 
match rate, and assuming that the difference in the estimate 
of Nis approximately linear in the nonmatch rate, there 
would have been a 26% difference between the usual 
maximum likelihood estimate of N and our estimate. 


Table 6 
Dual-System Data for Stratum 11, St. Louis 
Census 
P-sample 
Present Absent Total 

Present 487 129 616 
Absent 217 - 
Total 704 


Table 6 presents the usual dual system data for stratum 11, 
St. Louis. The number of people in both the census and 
the P-sample is y,;; = 300, the number of those in the 
census only is yj. = 217, and number in the P-sample only 
is Yo, = 129. The total census count is yj}4. = Yi + Y= 
704, the total P-sample count is +1 = Vip + Vo —' 616, 
the dual system estimate is DSE = y,,¥4)/y1; = 893 
(p. 232, Bishop, Fienberg and Holland 1975), and the esti- 
mated variance of DSE is Var(DSE) = 1 4.¥4112¥21/¥31 = 
105.4 (p. 233, Bishop et al. 1975). The standard error is 
SE( DSE) ="10.27: 
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The census U undercount for_ the population estimate 
DSE is (DSE - y14)/ DSE x 100% = 21.17%. 
For our best fitting model, the census undercount is 
(N — y,,)/N = 55.97% for the estimate N = 1599 
assuming no matching error and 55.58% for N = 1585 
from matching error Model (2). Thus there isa 55.97% — 
55.58% = 0.39% upward bias by ignoring matching 
errors. This is quite close to the figure of 0.37% computed 
in Ding and Fienberg (1994) for the 1986 Los Angeles test 
census data using a two-sample match rate of 99.4734%, 
as compared to 99.6637% here for the St. Louis data. Our 
estimates show that the urban Black male adults targeted 
in the St. Louis Dress Rehearsal were heavily undercounted 
by the census, and that the undercount is severely under- 
estimated by the usual dual-system or capture-recapture 
estimator of the population size. A third and qualitatively 
different sample might work well for this demographic 
group. 

The homogeneity of the capture probabilities is one of 
the assumptions in the standard approach to the estimation 
of the size of a closed population. Darroch et al. (1993) 
developed a quasi-symmetry model and a partial quasi- 
symmetry model to allow for varying catchability of 
individuals. The quasi-symmetry model assumes that the 
pattern of heterogeneity is the same for all three samples, 
the partial quasi-symmetry model assumes that the pattern 
of heterogeneity is the same for two samples but different 
for the third sample. This is a sensible model given that 
the third sample is qualitatively quite different from the 
census and the PES and this model is equivalent to a 
combination of dependence and heterogeneity. For the 
multinomial cell probabilities including the missing cell, 
R = (Pitt “1129 «+ +> 7222), both are log-linear models of 
the form log R = A@ with an appropriately chosen design 
matrix A and a vector of parameters 8. The design matrices 
for both models are given in Darroch et al. (1993). 


Table 7 
Heterogeneous Catchability Models 


MLE from MLE Using Matching 
Log-Linear Darroch et al. (1993) Error Model (2) 
Model 5 
N (S.E.) Fit (d.f.) N(S.E.) Fit (d.f.) 

Full quasi- 

symmetry 1923.63 (216.84) 133.54(2) 1906.61 (213.47) 133.50 (2) 
Partial quasi- 

symmetry 2576.54 (413.28) 11.70 (1) 2557.08 (409.39) 11.72 (1) 


Our proposed method can readily incorporate heter- 
ogeneous catchability to estimate the population size by 
assuming a heterogeneity model for Table 1 and then 
adopting the conditional likelihood estimation (Sanathanan 
1972). Table 7 presents estimates from fitting the quasi- 
symmetry model and the partial quasi-symmetry model for 


the data from stratum 11. Again, the effect of the matching 
errors in this analysis is not substantial due to the high 
matching rate. The partial quasi-symmetry model fits 
much better than the quasi-symmetry model, indicating 
there seems to be plausible heterogeneity and the pattern 
of heterogeneity seems different in the ALS. The lack of 
fit of the independence model might also be explained in 
part by the dependence among the samples (in particular 
between the census and the P-sample) and in part by 
heterogeneous catchability. 

The partial quasi-symmetry model incorporates the 
[CP] dependence and thus is an alternative to the model 
[CP] [PA] in Table 5. The two models yield similar fits 
to the data, but they give dramatically different estimates 
of N, with the model incorporating heterogeneity having 
a much larger estimate accompanied by a much larger 
estimated standard error. This suggests that there is a 
considerable instability associated with heterogeneity 
parameters and, although the two models are not nested 
and thus not directly comparable, it seems reasonable to 
opt for the smaller and more stable estimate which does 
not incorporate heterogeneity. 

Darroch ef al. (1993) considered four substrata for 
stratum 11 in their analysis. The two cross-classification 
variables for the four substrata O2, R2, O3 and R3 are 
whether residents owned or rented homes and whether 
they were age 20-29 or 30-44. The data for the four sub- 
strata are given in Table 8 where 1 corresponds to presence 
in a sample and 0 is for absence. We have reanalyzed them 
for comparison. Table 9 and Table 10 give estimates for 
both heterogeneity models. As pointed out earlier, the high 
match rate yields similar estimates and fits for models 
incorporating matching errors. The partial quasi-symmetry 
model shows significant improvement in fits over the full 
quasi-symmetry model with the best fits obtained for R2 
and R3. If we add the estimates of N across the four 
substrata, the total for the matching error version of 
partial quasi-symmetry is N = 2980.8, more than 16% 
larger than the estimate from the collapsed model in 
Table 7. Of course, the standard error of the estimate has 
increased by a similar magnitude. 


Table 8 


Three-Sample Data for Four Substrata of Stratum 11 
Source: Table 2, Darroch ef al. (1993) 


Sample Substratum 
G IP A O2 R2 03 R3 
0) 0 1 59 43 35 43 
0 1 0 8 34 10 24 
0 1 1 19 11 10 13 
1 0 0 31 41 62 32 
1 0 1 19 4 13 7 
1 1 0 13 69 36 69 
1 1 1 79 58 91 ae, 
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Table 9 
Estimates for Full Quasi-Symmetry 


MLE from MLE Using Matching 
Sub- Darroch et al. (1993) Error Model (2) 
SE 5 
N (S.E.) Fit (d.f.) N(S.E.) Fit (d.f.) 
O02 780.83 (294.81) 11.70(2) 777.98 (293.99) 11.69 (2) 
R2 394.34 (56.45) 41.09(2) 391.14 (55.29) 41.02 (2) 
03 765.45 (254.57) 25.99(2) 759.97 (252.44) 25.98 (2) 
R3 361.83 (47.33) 59.31(2) 358.71 (46.20) 59.22 (2) 
Table 10 


Estimates for Partial Quasi-Symmetry 


MLE from MLE Using Matching 
Sub- Darroch et al. (1993) Error Model (2) 
20 ———— = 
N (S.E.) Fit (d.f.) N(S.E.) Fit (d.f.) 
O2 605.66 (212.63)  7.51(1) 601.44 (210.93) 7.52 (1) 
R2 652.34 (205.12)  0.04(1) 646.59 (202.58) 0.04 (1) 
03 1124.00 (473.26) 8.27(1) 1126.90 (476.54) 8.22 (1) 
R3 611.78 (200.82)  2.92(1) 605.91 (198.26) 2.92 (1) 


6. SUMMARY 


In this paper, we have presented models for matching 
errors and models for the estimation of the population 
total and census undercount in a multiple sample census. 
We have illustrated our methods by reanalyzing census 
coverage data from the 1988 St. Louis Dress Rehearsal 
census. Two sources of information are considered in our 
analysis, the data from a Matching Error Study (MES), 
and triple-system data with every individual cross-classified 
according to presence or absence in each of three samples: 
the census, a post enumeration survey (P-sample) and an 
administrative list supplement. We imbed the standard 
log-linear model formulation of Fienberg (1972) into our 
estimation procedure to account for statistical dependency 
together with matching errors and to allow for formal 
goodness-of-fit test of various models. Our method applies 
to any model of a log-linear form and we have illustrated 
how heterogeneity models can be incorporated into our 
approach to allow for both matching errors and heter- 
ogeneous catchability. 

Our matching error models assume that false matches 
are negligible. Sensitivity analysis in Ding (1990) shows 
that when both the false nonmatch rate and the false match 
rate are the same order of magnitude, the matching bias is 
dominated by the false nonmatch rate (see also Fay, Passel, 
Robinson and Cowan 1988, p. 53). This is because the 
capture probabilities in the census and the post enumeration 
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survey are high, and thus a comparable change in both the 
false nonmatch and false match rates has substantially 
more impact on false nonmatches than false matches. For 
the 1986 Los Angeles test census data, the estimates of 
false nonmatch rate and false match rate computed in 
Ding and Fienberg (1994) are about 0.5% and 0.8%, 
respectively. Based on these empirical findings, we have 
some reason to believe that, at least in the census applica- 
tion described here, our models for false nonmatch errors 
are reasonable approximations to reality. 

We have analyzed the St. Louis triple-system data with 
an estimate of the matching rate taken from the MES. 
Matching rates may not be homogeneous over different 
population strata, and we suggest that the MES data 
associated with the same sampling stratum be used. We 
have developed formulation in §3 for the k-sample census, 
and our approach can be readily applied to a k-sample 
census with k = 4. 


REFERENCES 


BISHOP, Y.M.M., FIENBERG, S.E., and HOLLAND, P.W. 
(1975). Discrete Multivariate Analysis: Theory and Practice. 
Cambridge, MA: M.I.T. Press. 


CHEN, T.T. (1979). Log-linear models for categorical data with 
misclassification and double sampling. Journal of American 
Statistical Association, 74, 481-488. 


CORMACK, R.M. (1968). The statistics of capture-recapture 
methods. Oceanography and Marine Biology, Annals Review, 
6, 455-506. 


DARROCH, J.N. (1958). The multiple-recapture census, I: 
estimation of a closed population. Biometrika, 45, 343-359. 


DARROCH, J.N., FIENBERG, S.E., GLONEK, G.F.V., and 
JUNKER, B.W. (1993). A three-sample multiple-recapture 
approach to census population estimation with heterogeneous 
catchability. Journal of American Statistical Association, 88, 
1137-1148. 


DING, Y. (1990). Capture-recapture census with uncertain 
matching. Ph.D. dissertation, Department of Statistics, 
Carnegie Mellon University, Pittsburgh, Pennsylvania. 


DING, Y., and FIENBERG, S.E. (1994). Dual system estimation 
of census undercount in the presence of matching error. 
Survey Methodology, 20, 149-158. 


FAY, R.E., PASSEL, J.S., ROBINSON, J.G., and COWAN, C.D. 
(1988). The coverage of population in the 1980 census. Bureau 
of the Census, U.S. Department of Commerce. 


FIENBERG, S. E. (1972). The multiple recapture census for 
closed populations and incomplete oF contingency tables. 
Biometrika, 59, 591-603. 


HOGAN, H., and WOLTER, K. (1988). Measuring accuracy in 
a Post-Enumeration Survey. Survey Methodology, 14, 99-116. 


MULRY, M.H., DAJANI, A., and BIEMER, P. (1989). The 
Matching Error Study for the 1988 Dress Rehearsal. Proceedings 
of the Section on Survey Research Methods, American 
Statistical Association, 704-709. 


64 Ding and Fienberg: Multiple Sample Estimation of Population and Census Undercount 


SEBER, G.A.F. (1982). The Estimation of Animal Abundance 
and Related Parameters. New York: MacMillan. 


ZASLAVSKY, A.M., and WOLFGANG, G:S. (1993). Triple 


SANATHANAN, L. (1972). Estimating the size of a multi- System Modeling of Census, Post-Enumeration Survey and 
nomial population. Annals of Mathematical Statistics, 43, Administrative List Data. Journal of Business and Economic 
Statistics, 11, 279-288. 


142-152. 


RAO, C.R. (1957). Maximum likelihood estimation for the 
multinomial distribution. Sankhya, 18, 139-148. 


Survey Methodology, June 1996 
Vol. 22, No. 1, pp. 65-75 
Statistics Canada 


65 


Applying the Lavallée and Hidiroglou Method to 
Obtain Stratification Boundaries for the Census Bureau’s 
Annual Capital Expenditures Survey 


JOHN G. SLANTA and THOMAS R. KRENZKE! 


ABSTRACT 


The Lavallée-Hidiroglou (L-H) method of finding stratification boundaries has been used in the Census Bureau’s 
Annual Capital Expenditures Survey (ACES) to stratify part of its universe in the pilot study and the subsequent 
preliminary survey. This iterative method minimizes the sample size while fixing the desired reliability level by 
constructing appropriate boundary points. However, we encountered two problems in our application. One problem 
was that different starting boundaries resulted in different ending boundaries. The other problem was that the 
convergence to locally-optimal boundaries was slow, i.e., the number of iterations was large and convergence was 
not guaranteed. This paper addresses our difficulties with the L-H method and shows how they were resolved so 
that this procedure would work well for the ACES. In particular, we describe how contour plots were constructed 
and used to help illustrate how insignificant these problems were once the L-H method was applied. This paper 
describes revisions made to the L-H method; revisions that made it a practical method of finding stratification 


boundaries for ACES. 


KEY WORDS: Convergence; Contour plots; Economic surveys. 


1. INTRODUCTION 


The primary objectives of the sample design of the Census 
Bureau’s Annual Capital Expenditures Survey (ACES) 
are to meet desired reliability levels using operationally- 
feasible methodology and to stay within budget limita- 
tions. To achieve these goals, we implemented a stratified 
simple random sample design using a modified version of 
Lavallée and Hidiroglou’s (L-H) (1988) approach of 
finding stratum bounds. This stratification method for 
skewed populations obtains optimal boundary points by 
minimizing the total sample size given a desired coefficient 
of variation (c.v.). Survey managers associated with a 
single-purpose survey having access to a single stratifier 
can benefit from its operational ease and cost reductions. 

We considered several papers that documented other 
methods for finding size stratum boundaries. Hess, Sethi, 
and Balakrishnan (1966) compared several stratifying 
techniques. The popular Dalenius and Hodges method 
(Cochran 1977, p. 129) was considered easy to implement 
in our case but was initially ruled out because it was not 
designed with certainty strata in mind. Sethi’s method 
(1963) of using standard distributions was not used because 
we thought it would be cumbersome to identify the distri- 
bution and sub-optimal to use standard distributions for 
each of the 80 ACES industries. Eckman’s rule (1959) of 
equalizing the product of stratum weights and stratum 
range seemed to require rather ominous calculations. 


The L-H method was the most appealing to our appli- 
cation. Designed specifically for skewed populations, 
which is often the case for economic surveys, it creates a 
boundary that defines the take-all stratum, and the optimal 
boundary point(s) for the take-some strata. It sometimes 
will create additional take-all strata if through Neyman 
Allocation, the stratum sample size is greater than or equal 
to the stratum size. 

The L-H method goes through an iterative algorithm 
beginning with computing or arbitrarily setting the initial 
stratum boundaries. Then, stratum statistics are computed 
such as, the stratum size, mean, and the variance. These 
parameters are entered into boundary formulas that were 
derived from minimizing the sample size subject to a desired 
cv. If the new boundaries do not converge then the stratum 
statistics are calculated for the newly defined size strata. 
The cycle continues until the boundaries converge. 

Schneeberger (1979) discussed the problem of finding 
optimal stratification boundaries. Schneeberger shows in 
the paper that when expressing this problem as a non-linear 
program, when solved by a gradient method, the solution 
may be relative or global minima, maxima, or saddle 
points of the variance of the sample mean. Detlefsen and 
Veum (1991) document this as a shortcoming of the L-H 
method when testing its application for the Census Bureau’s 
Monthly Retail Trade Survey. In the L-H method, they 
found that many times the resulting boundaries differed 
substantially from where the initial boundaries were set, 
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so the minimum sample size attained was a local minimum. 
Geometrically, the sample size as a function of two strata 
boundaries, appears like a landscape with one or more 
bowl-shaped valleys. The L-H method begins in a region 
and descends until it reaches the lowest point. If more than 
one minimum exists, it will not continue to search for the 
global minimum. Therefore, one objective is to have initial 
boundaries that are in the neighborhood of the global 
minimum. Using starting boundaries resulting from a 
technique such as the Dalenius and Hodges method may 
help satisfy this desire. 

Detlefsen and Veum (1991) also noted instances of 
slow or non-convergence. However, they also noted that 
convergence occurred faster when the number of strata 
was reduced and when starting boundaries were the same 
as the previous survey’s sample selection boundaries. In 
order to defend ourselves against infinite loops in the 
computer program or a large number of iterations, we 
decided on doing two things. First, we implemented a 
sample design in which the L-H method would create sets 
of only three size strata. Second, we decided to implement 
stopping rules so that when the convergence rate appeared 
to slow down, the program stopped processing. 

In this work, we give background information on the 
ACES and briefly describe the way the L-H method was 
applied. We show how contour plots and three-dimensional 
plots gave us justification for using the L-H method to get 
the final boundaries. We show how the contour plots 
address the convergence problem by showing how con- 
straints can be setup to be met after each iteration. This 
would protect us against slow or non-convergence under 
the assumption that the marginal gain achieved is not 
worth the extra effort. 


2. ACES BACKGROUND 


The 1992 ACES was designed by the Census Bureau to 
be a large-scale operational test of the sampling, processing, 
programming, data entry, editing, and estimation procedures 
which extended beyond a 1991 pilot study, to prepare for 
the 1993 full-scale survey. Capital expenditure estimates 
for domestic activities were published at conglomerated 
industry levels from the 1992 survey. In addition, the 1991 
and 1992 preliminary surveys provided valuable capital 
expenditure data that will be used in future sample design 
enhancements. 

The sampling unit for the ACES was the company 
which may be comprised of several establishments. The 
sampled population included all active companies with five 
or more employees from all major industry sectors except 
Government. These sectors include mining, construction, 
manufacturing, transportation, wholesale and retail trade, 
finance, services, and a portion of the agriculture sector 
that includes agricultural services, forestry, fishing, 


hunting, and trapping. Only companies with domestic 
activity were included in the sampling frame. The Research 
and Methodology Staff of the Census Bureau’s Industry 
Division constructed the sampling frame, selected the 
sample, and generated estimates. 

The ACES sampling frame was constructed from the 
Census Bureau’s Standard Statistical Establishment List 
(SSEL) in November 1992 using final 1991 data for single 
unit (SU) establishments and 1990 data for establishments 
associated with multiunit (MU) firms. Major exclusions 
from the frame were public administration, U.S. Postal 
Service, international establishments, establishments in 
Puerto Rico, Guam, Virgin Islands, and the Mariana 
Islands. EI Submasters which are SU records on the SSEL 
that are associated with MU establishments, establishments 
associated with agricultural production, and private house- 
holds were also excluded from the frame. 

The establishment-based file was consolidated into a 
company-based file. In addition, the 4-digit Standard 
Industrial Classification (SIC) codes for each company 
were recoded into ACES categories. The 80 ACES cate- 
gories consisted of either 3-digit SICs or combinations of 
3-digit SICs. The ACES sampling frame included approx- 
imately two million companies. 


3. THE L-H METHOD APPLIED TO THE ACES 


The universe of companies was classified into two 
major strata. Stratum I was an arbitrarily defined take-all 
stratum that consisted of large companies with more than 
500 employees and over $100 million in assets. Stratum I 
companies were not classified into one ACES industry. For 
the estimated industry level payroll totals used in the calcu- 
lation of the industry-level sample sizes, stratum I companies 
could contribute to more than one ACES industry depending 
on the number of different ACES industries the companies 
have payroll in, identified in the SSEL. 

Stratum II contained companies that had five or more 
employees and had less than 500 employees. Stratum II 
companies were classified into one industry, even if engaged 
in more than one activity. Each company had frame infor- 
mation available for each of the ACES industries the 
company had activity in. However, the company’s payroll 
contributed only to estimated total payroll for the industry 
that the company was classified in. Subsequently, within 
stratum II, for each ACES industry category, three size 
strata were created based on total company annual payroll 
using the L-H method. 

A concern with the sample design is the result of 
companies being misclassified due to the measure of size 
being used. We classified each stratum II company into 
its highest payroll industry; however, companies self-report 
their capital expenditures into ACES industries on the 
ACES questionnaire. Companies may report in multiple 
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industries. If too many companies self-report into indus- 
tries other than where they were classified, then control 
on the reliability of the estimates is lost. 

A similar concern is that the variation in payroll is not 
the same as the variation in expenditures. Since sample size 
is directly related to the variance, sample sizes may be 
different than what is really required. Therefore, since the 
correlation between payroll and expenditures is not high, 
the chances that reliability constraints will be met will 
diminish. 

The application of the L-H method to the ACES 1992 
preliminary survey sample design involved splitting 
stratum II into one take-all size stratum and two take- 
some size strata for each ACES industry. The boundaries 
were derived for each industry by taking the partial deri- 
vative of the sample size with respect to a boundary while 
fixing the other boundary. However, in practice, we allowed 
both boundaries to move simultaneously. This results in 
an iterative process of minimizing the sample size for each 
industry subject to c.v. constraints. Within stratum II for 
each ACES industry and assuming Neyman Allocation 
(Detlefsen and Veum 1991), the sample size equation that 
is minimized is, 


(EI SR a ee (1) 


where, 17, is the number of companies in the take-all size 
stratum within stratum II defined by the L-H method, N 
is the number of stratum II companies in the ACES industry 
of interest, W, = N;,/N is the stratum proportion, IN; AS 
the number of stratum II companies for size stratum /, cv 
is the desired coefficient of variation for the ACES industry 
of interest, Y is the total payroll for stratum I and II for 
the ACES industry of interest defined by, 


N] 3 Nj 
2 a ae be Se yW Yji» 
k=1 j=l i=l 


N_ is the number companies in stratum I, and S; is the 
standard deviation of payroll from the SSEL for size 
stratum / in stratum II defined by, 


where, yj; is the payroll value of company / of size stratum 
J for the ACES industry of interest, and Y; is the mean 
of payroll for size stratum /. 
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The reliability level for each industry was an expected 
c.v. of 5% on payroll. It was not known, however, what 
standard errors would result for capital expenditures, as 
no capital expenditures data exist for the frame records. 
Companies responding in ACES industries different from 
the ones they contributed to in the sample design also 
caused the c.v.’s to fluctuate. The total number of com- 
panies selected for the ACES 1992 preliminary survey was 
11,194, consisting of 1,500 stratum I companies and 9,694 
stratum II companies. 


4. CONVERGENCE INTO NEIGHBORHOODS 


One of the problems with the L-H method is that it 
sometimes takes a large number of iterations before the 
boundaries converge; sometimes they never converge. 
Generally after just a few iterations, a large proportion of 
the improvement in the sample size has already occurred. 
Our goal was to be able to implement stopping rules so that 
when an area around a local minimum is reached, we can 
stop processing. This prompted our use of contour plots 
in analyzing the effect the boundaries have on the resulting 
sample size. It also allowed us to get a graphical view of 
the neighborhoods around the local minima. We will use 
two distributions to illustrate the benefits of reviewing con- 
tour plots. 


4.1 Non-Skewed Distribution 


The first example is a non-skewed distribution from 
Schneeberger’s paper. This distribution is symmetric at 
x = 1 as shown in Figure 1. 
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Schneeberger’s objective was to find boundaries for 
three take-some strata using a gradient method. Using the 
objective function of z = (Y W,,o,)”, the results attained 
are listed in Table 1. 


Table 1 
Optimum Boundaries for Non-Skewed Distribution 


by by Optimum Point 
(2a) .50241 1.03985 Minimum 
(2b) -70910 1.29090 Saddle Point 
(2c) .96015 1.49759 Minimum 


Source: Schneeberger (1979). 
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Table 2 
L-H Boundaries for Three Take-Some Strata for Non-Skewed Distribution 


Ist Iteration 
N Starting Method 


b, b> n by 


50 N,; = No = N3 59 1.41 10.89 .66 
100 N, = Nz = Nz 59 1.41 12.60 .66 
200 N, = No = N3 59 1.41 13.42 .66 
1000 N,; = No =N; 39 1.41 13.85 .66 
5000 N,; = N,=N; 59 1.41 14.12 .66 
50 Dalenius-Hodges .70 1.40 10.09 .10 
100 Dalenius-Hodges -70 1.40 10.90 .84 
200 Dalenius-Hodges .70 1.40 11.42 .83 
1000 Dalenius-Hodges .70 1.40 11.86 .86 
5000 Dalenius-Hodges .70 1.40 11.95 .86 
50 Off Line 50 1.30 10.87 ssid 
100 Off Line 50 1.30 11.95 mil, 
200 Off Line 50 1.30 12.64 56 
1000 Off Line 50 1.30 13.24 56 
5000 Off Line 50 1.30 1133377 56 


We generated five datasets of different sizes (e.g., 
N = 50, 100, 200, 1000, and 5000) using the formula, 
F(x) = (Jj — 1/2)/N. For this example, we adapted the 
L-H method to construct three take-some strata and no 
take-all stratum in order to compare our results with the 
results in the Schneeberger paper. With our application 
of estimating totals, when minimizing the sample size 
subject toac.v. = 0.05, the L-H method ran for each of 
the five population sizes using three different starting 
techniques. The results are given in Table 2. 

There are three main points from the information in 
Table 2. First, the algorithms convergence depends on the 
population size. The underlying theory of the L-H method 
is based on continuous distributions. Our examples and 
any survey application has discrete data from finite popula- 
tions. It is also apparent that as N gets larger, the resulting 
boundaries get closer to where the minimum is under an 
infinite population size. Figure 2 shows the roughness of 
the sample size surface when N is small (i.e., N = 50). 
The resulting surface illustrates the saddle in three dimen- 
sions in Figure 2. In this graph, the axes are the lower and 
upper boundaries and the surface is the resulting sample 
sizes. This graph shows the saddle-point, the two local 
minima, and it also gives a picture of the magnitude of the 
sample size reductions as a result of shifting the boundaries. 
In contrast, Figure 3 shows the smoothness of the surface 
when Nis large (i.e., N = 5000). From this, it seems that 
the roughness of the sample size surface and consequently 
the population size has an effect on where the boundaries 
converge. 

The second point of this example reemphasizes that 
the ending boundaries are dependent on the starting 


Iteration Within 5% 


of Sample Size Final Iteration 


by n iter.# b, by n iter.# 
1.34 9.98 .70 3 SLs 4 
1.34 10.91 2 .70 1.30 10.55 5 
1.34 11.43 2 sill 129 10.99 6 
1.34 1T-75 2 hl 1:29 es 7 7 
1.34 11.84 2 71 29 11.45 ) 
1.40 10.09 1 svi ‘ey 9.63 4 
1.40 10.14 ? 293 1.47 9.65 13 
1.40 10.44 ih Ls 1.49 9.96 17 
1.42 10.67 8 .96 1.50 10.27 23 
1.42 10.74 8 .96 1.50 10.34 28 
1.20 9.43 3 255) Lebt Weil! 6 
1.18 10.04 3 53 1.07 9.65 8 
1.14 10.28 4 51 1.05 9.96 12 
1.14 10.59 4 .50 1.04 10.27 18 
1.14 10.67 4 .50 1.04 10.34 24 


boundaries. For this example, Schneeberger describes that 
with a starting point symmetric to x = 1, where bb; = 1 —X 
and b, = 1 + (0 < X < 1) which defines the line 
b, = 2 — b,, the gradient method moves the gradient 
along the line b) = 2 — b, into the saddle-point. When 
we set the starting boundaries on this line, which occurred 
when we started with the condition N; = N, = N3, the 
L-H method also converged to the saddle point (see 
Table 1). With starting boundaries from the Dalenius- 
Hodges method, which are not on the line in the case where 
b, > 2 — b,, the L-H method converged to a minimum 
(2c). The Dalenius-Hodges method works well in this 
example because of the three take-some strata. With 
starting boundaries which are not on the line in the case 
where b, < 2 — b, (specifically, b} = .S and b, = 1.3), the 
L-H method converges to a different minimum (2a). This 
problem is not unique to the L-H method, as Schneeberger 
points out that the gradient method’s resulting boundaries 
are also dependent of the starting boundaries. 

The third point of this example is that there seems to 
be relatively large reductions in sample size in the first few 
iterations and then there are several iterations where there 
are small reductions in sample size. Results are shown in 
Table 2 from the iteration in which the algorithm produced 
a sample size within 5% of the final sample size. This 
implies that the L-H algorithm quickly goes to a neigh- 
borhood around an optimal boundary. While close to an 
optimal sample size, there seems to be a wide range of 
boundary points resulting in a small range of sample sizes. 
The point is that stopping rules can save computing time 
while not relinquishing any real reduction in sample size, 
since sample size is in integer values. 
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Two Minima and a Saddle Point 
Plot of f(x) 


Figure 1. Graph of non-skewed distribution. 


Two Local Minima and Saddle Point 
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Figure 2. Sample size surface for non-skewed distribution (N = 50). 
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Two Local Minima and Saddle Point 
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Figure 3. Sample size surface for non-skewed distribution (N = 5000). 
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Contour Plot 


Figure 4. Contour plot for non-skewed distribution (N = 5000). 
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A contour plot of the surface shown in Figure 3 is given 
in Figure 4. Again, the axes are the lower and upper 
boundaries and the surface is defined by the resulting 
sample size. The lines in the plot represent a sample size 
value. The space between the lines gives an area that 
contains a range of sample size values. For example, a solid 
line represents a sample size of 11 and a series of short dash 
marks represents a sample size of 13. The area in between 
the solid line and the line of short dash marks contains 
sample sizes in the range of 11 to 13. This contour plot 
shows a marginal improvement in the sample size by 
illustrating that when an area around the bottom of the 
surface is reached, moving on is unnecessary. At this 
point, most of the improvement on the sample size from 
iteration to iteration is less than a value of one. It becomes 
apparent that after the first few iterations, the improvement 
of the sample size from iteration to iteration reduces 
quickly. For instance, in Table 2, where N = 5000 and 
where the Dalenius-Hodges method was used for the 
starting boundaries, the first eight iterations accounted for 
74% of the total reduction in the sample size from itera- 
tion | to the 28th and final iteration. 


4.2 A Skewed Distribution 


Economic data are usually highly skewed and therefore 
it is more appealing to have a take-all stratum. The next 
example comes from the Pareto distribution, which is a 
very typical distribution of economic universes, where 
there are a large number of small companies and a small 
number of large companies. 
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The Pareto distribution function is defined as F(x) = 
1 — 1/(1 + x)°, 0 < x < o. From this we again 
generated five datasets of different sizes using the formula 
F(x) = (j — 1/2)/N. We let the values of b change as 
the population size changed. This was done so as to keep 
the upper tail of the finite discrete distribution roughly the 
same proportion to the entire population for each popula- 
tion size. To do so, the parameter b was chosen in such 
a way that about 90% of the total sum could be accounted 
for in the top 20% of all possible sampling units. Since the 
datasets contain a finite number of discrete values there 
was no problem deriving variances of different strata when 
values of b were less than 2. 

Table 3 gives the L-H results for different population 
sizes and starting points. The first group uses starting values 
which yield equal stratum populations (N; = N> = N3). 
The second group uses the Dalenius-Hodges method to 
obtain all initial boundaries. The third group obtains 
starting boundaries by first using a method for deter- 
mining the take-all boundary as presented by Hidiroglou 
(1986) and uses the Dalenius-Hodges method for the other 
boundary. Again it can be observed that the sample size 
surface given strata boundaries is much more choppier for 
smaller population sizes (see Figure 5). For example, when 
N = 50 and J, is fixed, there was only one sample size 
when by varied between 11.8 and 14.7. This is because 
there were no values within this range in the population. 
As the population size increases, the data values are closer 
together, and the sample surface becomes very smooth 
(see Figure 6). 
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L-H Boundaries for Skewed Distribution (one take-all stratum, two take-some strata) 
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Figure 5. Sample size surface for skewed distribution (N = 50). 
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Figure 6. Sample size surface for skewed distribution (N = 5000). 
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Skewed Distribution 
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Figure 7. Contour plot of skewed distribution (N = 50). 
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Figure 8. Contour plot of skewed distribution (N = 5000). 
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The contour plot for N = 50 (Figure 7) has erratic 
shapes defined by straight lines for contour markings. The 
contour plot for N = 5000 (Figure 8) has almost smooth 
concentric ellipses for contour markings. It would appear 
to be a desirable quality for the contour markings to be 
the same shape and concentric. This would imply that the 
global minimum is the only local minimum. 

The contour plot for N = 50 demonstrated the case 
where the L-H method didn’t converge to optimal bound- 
aries. Since, for this example, we let the L-H program run 
until it converged the question may arise as to why the L-H 
method didn’t converge to the optimal boundaries. The 
easiest way to explain this is by viewing Figure 5. We can 
see that when the population size is small then the sample 
size surface is not as smooth as in Figure 6. We see several 
major ridges in Figure 5 that are caused by wide gaps 
in the skewed discrete data (x43 = 9.71, x44 = 11.81, 
X45 = 14.79, x46 = 19.29). This means that for a given 
b,, any value of b, between 11.81 and 14.79 would yield 
the same sample size. When we ran the L-H program for 
different starting boundaries other than the three listed in 
Table 3 we came up with the final boundaries as in Table 3 
along with other boundaries and their corresponding 
sample sizes. It appears that the L-H method converges 
to a low region on one of the major ridges, provided that 
the region is in the neighborhood of the optimal bound- 
aries. The minimum sample size is 9.22 and the L-H 
method in Table 3 yielded a sample size of 9.36. The 
smallest whole integer sample size for each result that 
meets or exceeds the constraint is 10. Here again we see 
that the L-H method performs exceptionally well even with 
discrete distributions that have small population sizes as 
we see that the boundaries converge within the neigh- 
borhood containing the optimal solution. 

Another observation to be pointed out is that there is 
a broad range of values that the boundaries can take on 
while keeping the integer value of the sample size the same. 
As the size of the neighborhood expands, the range of 
boundary values extends as well. It should also be pointed 
out that even though the range of b, values for a given 
neighborhood is smaller than the range of values for bp, 
there are far more sampling units in the range of b, than 
by because of the skewed distribution. 


5. SUMMARY 


The graphs presented here have shown that a wide range 
of boundary values result in a small range of sample sizes 
when in a neighborhood around an optimal value (the 
bowl shape bottom of the graphs). Any extraordinary 
improvement on the sample size, i.e., a small marginal 
gain, might not be worth the extra effort to obtain. This 
marginal gain may or may not even improve the sample 
size since the sample size is really an integer and the 


marginal gain might only be a small fraction. The L-H 
method proved very effective in obtaining boundary 
values in a desired neighborhood around an optimal value, 
and did it relatively fast. 

By measuring the rate of convergence using the sample 
size instead of boundary values we were better able to 
determine when a desired neighborhood around an optimal 
value was reached. This is because boundary values vary 
greatly in such a neighborhood while sample size (which 
is of main interest) varies slightly. When the improvement 
in sample size from iteration to iteration was marginal or 
nonexistent we immediately terminated the program under 
the assumption that we reached the desired neighborhood. 
The following stopping rules are recommended. Stop 
processing when: 


1) the difference between the new upper boundary and the 
previous iteration’s upper boundary is less than one. 
The whole number, one, is used in our case since payroll 
values are only available to us in whole number values 
and any shifting of boundaries of a value less than one 
does not affect any companies; 


2) the difference between the new lower boundary and the 
previous iteration’s lower boundary is less than one; 


3) the difference between the new sample size and the 
previous iteration’s sample size is less than a small 
arbitrary value. We recommend a number less than one 
since sample sizes are usually rounded up and any 
fractional improvement on the sample size is negligible. 
One should be careful when choosing this value since 
it is possible that the sample size reduction rate may 
increase from iteration to iteration because the slope of 
the surface changes; 


4) the program goes into the 30th iteration. Of course, this 
is an arbitrary value and may depend on the number of 
times (industries) one has to apply the L-H method. 


Another note is that small population sizes may cause 
convergence of the boundaries to a point suboptimal, as 
shown in the examples. Graphs of the sample size surface 
show a rough surface for small populations and a smooth 
surface for large populations. It is this rough surface due 
to the discrete nature of the small population that contrib- 
ute, in part, to where the L-H method converges. 

Another point in conclusion, in our application, the 
Dalenius-Hodges method assumes that all resulting strata 
will be sampled. The L-H method is written to construct an 
analytical take-all substratum. Therefore, the top stratum 
developed by the Dalenius-Hodges method, when creating 
the initial boundaries for ACES industries, will be top- 
heavy since it will not be sampled. Improvements in the 
sample size were noticed from the Dalenius-Hodges method 
to the first iteration of the L-H method in this situation. 
The error that occurs is that the starting boundaries may 
lead to a local minimum that is not the best solution. 
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A New Method to Reduce Unwanted Ripples and Revisions 
in Trend-Cycle Estimates From X-11-ARIMA 


ESTELA BEE DAGUM! 


ABSTRACT 


The estimation of the trend-cycle with the X-11-ARIMA method is often done using the 13-term Henderson filter 
applied to seasonally adjusted data modified by extreme values. This filter however, produces a large number of 
unwanted ripples in the final or ‘‘historical’’ trend-cycle curve which are interpreted as false turning points. The 
use of a longer Henderson filter such as the 23-term is not an alternative for this filter is sluggish to detect turning 
points and consequently is not useful for current economic and business analysis. This paper proposes a new method 
that enables the use of the 13-term Henderson filter with the advantages of: (i) reducing the number of unwanted 
ripples; (ii) reducing the size of the revisions to preliminary values and (iii) no increase in the time lag to detect turning 
points. The results are illustrated with nine leading indicator series of the Canadian Composite Leading Index. 


KEY WORDS: Trend-cycle; X-11-ARIMA; Turning points; Leading economic indicators. 


1. INTRODUCTION 


The estimation of the trend-cycle with the X-11-ARIMA 
seasonal adjustment method (Dagum 1980, 1988) as well 
as the U.S. Bureau of the Census X-11 variant (Shiskin, 
Young and Musgrave 1967) is done by the application of 
linear filters due to Henderson (1916). These Henderson 
filters are applied to seasonally adjusted series where the 
irregulars have been modified to take into account the 
presence of extreme values. The length of the filters is 
automatically selected on the basis of specific values of 
noise to signal ratios (I/S) being the most commonly 
chosen the 13-term filter. 

The problem of trend-cycle estimation has attracted the 
attention of several authors, among others, Rhoades (1980); 
Cholette (1981, 1982); Kenny and Durbin (1982); Castles 
(1987); Dagum and Laniel (1987); Cleveland, Cleveland, 
McRae and Terpenning (1990); Wallgren and Wallgren 
(1990); Gray and Thomson (1990); Findley and Monsell 
(1990); Scott (1990); and Kenny (1993). Nevertheless, most 
statistical agencies (excepted the Australian Bureau of 
Statistics) concentrate their publications on seasonally 
adjusted series and only very few provide some sort of infor- 
mation on the trend-cycle, usually under the form of graphs. 

There are several reasons for limiting the publication 
of trend-cycle estimates. In the majority of the cases, the 
seasonally adjusted data are already smooth enough as to 
be able to provide a clear signal of the short-term trend. 
But for highly volatile series where further smoothing is 
required the main objections for trend-cycle estimation 
are: (1) the size of the revisions of the most recent values 
(generally much larger than for the corresponding seasonally 
adjusted estimates) and (2) the presence of short cycles or 
ripples (9 and 10 months cycles) in the final trend-cycle 


curve when the 13-term Henderson filter is applied. On this 
regard, Kenny (1993) has argued that the presence of 
ripples in the final estimates of the trend-cycle leads to a 
large number of false turning points, making the 13-term 
filter unsuitable for monitoring turning points. He has 
proposed the use of the 23-term Henderson filter with the 
object of obtaining a much smoother trend. However, it 
is well known that this longer filter is sluggish to detect 
turning points and, hence not useful for current economic 
and business analysis. For this latter viewpoint, the 13-term 
filter is preferable but it produces ripples which can be 
interpreted as false turning points (an unwanted property). 

The main purpose of this study is to introduce a new 
method by which the 13-term Henderson filter can be used 
with the advantages of: (1) reducing the number of un- 
wanted ripples, (2) reducing the size of the revisions made 
to the most recent estimates when new observations are 
added to the series, and (3) not increasing the time lag to 
detect turning points. 


2. TREND-CYCLE CASCADE FILTERS 


The 13-term Henderson filter is the most often selected 
and combined with the standard seasonal filters (5- and 
7-term moving averages) produces a symmetric cascade 
filter for final or central values (at least four years from 
each end of the series) with a gain as exhibited in Figure 1. 

Figure 1 also shows the gain functions of other filter 
convolutions, namely: (1) short seasonal filters with the 
9-term Henderson filter and (2) long seasonal filters with 
the 23-term Henderson filter. It is apparent that cycles of 
9 and 10 months (in the 0.08-0.16 frequency band) will not 
be suppressed by any of the cascade filters, particularly, 
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Figure 1. Trend-cycle symmetric cascade filters. 
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Figure 2. Trend-cycle concurrent cascade filters. Standard 
seasonal m.a. combined with three Henderson filters. 


those using the 9- and 13-term Henderson filters. In fact, 
the symmetric trend-cycle cascade filter that results from 
the 9-term Henderson passes about 90% of the power of 
these short cycles; 72% and 21% are passed by the 13- and 
23-term Henderson filters, respectively. 

For the concurrent trend-cycle filters which are applied 
to the last available observation, the peak reached at the 
frequency band corresponding to 9 and 10 months cycles 


is even larger (see Figure 2). Furthermore, all these asym- 
metric filters introduce phase shift, being near to two 
months for the 23-term (the largest), one month for the 
13-term, and one-half month for the 9-term filter. 


= 


° 


phase shift (in months) 
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Figure 3. Trend-cycle concurrent cascade filters, (3 x 3)(3 x 5) 
{H — 13], with and without ARIMA extrapolations. 


Figure 3 shows how the use of ARIMA extrapolations 
makes the gain of the concurrent cascade filters (using the 
13-term Henderson) to resemble the symmetric one although 
at the expense of a small increase in phase shift. The 
extrapolations are from an ARIMA model (0,1, 1) (0,1,1); 
where the regular moving average parameter is 6 = 0.40 
and the seasonal moving average parameter isO = 0.60. 

Although not shown for space reasons,the gain and 
phase shift of this trend-cycle concurrent filter fall between 
the other two combinations. 

When ARIMA extrapolations are used, the gain of the 
concurrent filter converges very fast to that of the final. 
Dagum and Laniel (1987) show that after three more 
observations are added to the series, the gain of the asym- 
metric trend-cycle filter is very close to the symmetric one. 
The properties of these filters are also extensively discussed 
in Dagum, Chhab and Chiu (1993, 1996). 
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The presence of ripples in the final trend-cycle estimates 
will be produced by the 13-term Henderson filter only if 
some power is present in the input to the filter at the 
0.08-0.16 frequency band. The input to the filter is the 
seasonally adjusted data with extreme values replaced. 

In most empirical cases, the presence of unwanted 
ripples occurs in periods of high volatility when the 
observed data are mostly influenced by outliers which can 
be falsely interpreted as turning points. Although the 
seasonally adjusted series are modified by extreme values, 
there is a need for further smoothing which can be done 
either by applying a longer Henderson filter or by being 
stricter with the replacement of outliers. Since we want to 
keep the advantage of a short filter to detect turning points 
faster, the latter approach is the one followed here. 

In the current procedure, the default sigma limits for 
the replacement of extreme values are + 1.5 sigma and 
+2.5 sigma. Values greater than + 2.5 sigma receive a zero 
weight and those smaller than + 1.5 sigma a weight of one 
(full weight). Values falling within the boundaries are 
assigned a linearly graduated weight between zero and one. 


3. ANEW METHOD 


The new method here proposed, basically consists of: 
(1) extending a smoothed seasonally adjusted series 
(modified by extreme values with zero weight) with ARIMA 
extrapolations, and (2) applying the 13-term Henderson 
filter to the extended series using stricter sigma limits for 
the identification and replacement of extreme values. 

Experimentation with real data showed that the power 
spectrum of the seasonally adjusted series at the 0.08-0.16 
frequency band was drastically reduced only when strict 
sigma limits such as +0.7 sigma and +1.0 sigma were 
used. Hence, when applying the 13-term Henderson filter, 
the trend-cycle curve did not exhibit unwanted ripples 
while still maintaining its good property of rapid detection 
of turning points. Under the assumption of normality, 
these new sigma limits imply that 48% of the irregulars will 
be modified, 32% will get zero weight and will be replaced 
by the mean value and 16% will get graduated weights 
from zero to one. 

The extension of the smoothed seasonally adjusted 
series with ARIMA extrapolations is needed to reduce the 
size of the revisions for the most recent estimates of the 
trend-cycle. 

The implementation of this new procedure in the context 
of the X-11-ARIMA and X-11 methods must be done in 
two steps as follows: 


(1) Produce the best seasonally adjusted series selecting 
appropriate options for the estimation of the compo- 
nents, that is, seasonality, trend-cycle, trading-day varia- 
tions and Easter effects plus permanent or temporary 
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priors, if applicable. The seasonally adjusted values are 
printed in Table D11. The seasonally adjusted series 
is modified by extreme values with zero weights using 
the default sigma limits and printed in Table E2. When 
the estimates of the published seasonally adjusted 
series for the current year are modified according to 
some revision practices, then this published revised 
series should be resubmitted to the X-11-ARIMA 
program to obtain the corresponding output shown 
in Table E2. 


(2) The output from Table E2 is extended with one year 
of extrapolations from an ARIMA model. The ARIMA 
model found adequate with many real series is the (0,1,1) 
(0,0,1) model. Although the output from Table E2 
does not contain seasonality, the seasonal moving 
average parameter (often of very small value) is needed 
to correct for some sort of seasonal autocorrelation in 
the data. The extended series is then run with the 
X-11-ARIMA program using the Summary Measures 
option and requesting strict sigma limits (+ 0.70 and 
+1.00) and the 13-term Henderson filter. The new 
trend-cycle estimates are printed in Table D12. 


4. EMPIRICAL RESULTS 


The new method for trend-cycle estimation is tested 
with nine leading indicator series of the Canadian Com- 
posite Leading Index. In the so called ‘‘filtered’’ version 
of the Canadian Composite Leading Index published by 
Statistics Canada, each of the components series as well 
as the Index itself are smoothed applying to the seasonally 
adjusted data asymmetric filters based on ARMA models 
developed by Rhoades (1980). The spectral properties of 
these ARMA trend-cycle filters are similar to those of the 
end point of the 9- 13- and 23-term Henderson filters 
depending on the ARMA model chosen (see Cholette 
1982). (Although a comparison with the ARMA filters is 
not done in this paper, it is likely that the new approach 
will also give improved results.) Most of the series are 
highly volatile and all lead at turning points in the business 
cycle. The series are: 

TSE300 Stock Price Index (TSE300) 

House Spending Index (HSI) 

Money Supply (M1) 

Business and Personal Services Employment (BPSE) 
Average Workweek in Manufacturing (AWM) 
Retail Sales of Furniture and Appliances (RSFA) 
Retail Sales of Durable Goods (RSDG) 

New Orders for Durable Goods (NODG) 

Shipments to Inventories Ratio (SIR). 

The advantages of the new procedure versus the currently 
available in X-11-ARIMA are evaluated as follows. 
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4.1 Reduction of Ripples in the Final Trend-Cycle 
Estimates 


To calculate the reduction of ripples we first introduce 
the definition of a turning point within the context of 
trend-cycle data. A turning point is generally defined as 
a point in time ¢ when a series, say Y, is larger (smaller) 
than or equal to the preceding k and subsequent m obser- 
vations of the series. That is, 


Yop 3.  Yy e  = Yiat = = Ni+m 
defines a downturn and 
Vie pee cael a a eS oli ae ie 


defines an upturn. 


From the viewpoint of seasonally adjusted series and 
trend-cycle data, there is no general consensus for what 
values of k and m, a turning point has occurred. Rhoades 
(1980) defines a turning point for kK = 1 and m = 0; 
Wecker (1979) defines a turning point to be the second of 
two (or more) successive declines or increases, i.e., for 
k = 2andm = 2; Zellner, Hong and Min (1991), LeSage 
(1991) and Pfeffermann and Bleuer (1992) have chosen 
k = 3 and m = 0. These definitions do not necessarily 
correspond to those of cyclical turning points for business 
cycle analysis but any one can be useful to calculate the 
number of unwanted ripples as long as two turning points 


Hours 


(a downturn and an upturn) occur within a period of ten 
months or less. We use here the turning point definition 
for which k = 3andm = O given the smoothness of the 
trend-cycle data. 

Table 1 shows the number of ripples present in the 
trend cycle estimates from the standard and the modified 
13-term Henderson filter for the period January 1981- 
December 1993. 


Table 1 


Number of Unwanted Ripples in the Trend-Cycle Data 
Using the 13-Term Henderson Filter for the Period 


1981-1993 
Series Standard Procedure Modified Procedure 
NODG 9 2 
HSI 8 4 
RSDG 8 4 
BPSE 8 5 
AWM 7 1 
SIR 5 1 
TS300 4 2 
M1 4 2 
RSFA 4 0 


The results show that the reduction is larger for those 
series with a large number of ripples and significant in 
all cases. 


81 82 83 84 85 86 88 89 90 91 92 93 
Boe seereceee Seasonally Trend-cycle Trend-cycle 
adjusted standard H13 modified H13 


Figure 4. Average work week manufacturing. 
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adjusted 


Figure 5. New orders for durable goods. 


For illustrative purposes, Figures 4 and 5 for AWM and 
NODG respectively, exhibit the seasonally adjusted values 
and the trend-cycle data of both the standard and modified 
procedures. It is apparent that the new method reduces the 
ripples in the trend-cycle data with respect to those shown 
by the standard procedure. In fact, the modified trend- 
cycle data resembles that of the 23-term Henderson filter 
but with larger penetration into peaks and troughs of 
cycles of long duration. 


4.2 Turning Point Detection 


It is important that the reduction of ripples in the final 
estimates of the trend-cycle is not achieved at the expense 
of increasing the lag in detecting turning points which is 
the main limitation of the 23-term Henderson filter. 

To study the revision path of the trend-cycle for any 
given point in time, the estimates were computed for all 
end points and previous time points. The revision path of 
the modified trend-cycle values showed that the identifica- 
tion of cyclical turning points is done with an average lag 
similar to the standard approach. Depending on the series, 
the lag was either equal or plus minus one month. For illus- 
trative purposes, Figures 6a. exhibits the revision path of 
the modified trend-cycle values of New orders for durable 
goods for the cyclical turning point of February 1991. 
Successive updates are carried out using data up to 
March 1991, April 1991 and so on. The turning point is 
recognized in April, after 2 months whereas it takes 


Trend-cycle 
standard H13 


Trend-cycle 
modified H13 


Millions of 1981 dollars 
9400 


9200 

9000 

8800 

8600 

Sep-90 Oct Nov Dec Jan Feb-91 Mar Apr May June Jul 
Figure 6a. New orders for durable goods. Trend-cycle modified 
H13 revisions path. 
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Figure 6b. New orders for durable goods. Trend-cycle standard 
H13 revisions path. 
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3 months for the standard procedure as exhibited in 
Figure 6b. Furthermore, it is shown that successive revi- 
sions of the trend-cycle estimates keep generally very close 
to the final values. The lines which protude, indicating a 
large revision, can be explained in terms of the underlying 
data which seem to indicate an increasing decline contra- 
dicted by the following values. 

Figures 7a. and 7b. for the Average work week in 
manufacturing reveal that the turning point February- 
March 1991 is detected three months later by both 
procedures. 


Hours 
38 


37.8 


37.6 


37.4 
Oct-90 Nov Dec Jan Feb Mar-91 Apr May = Jun Jul Aug 


Figure 7a. Average work week manufacturing. Trend-cycle 
modified H13 revisions path. 
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Figure 7b. Average work week manufacturing. Trend-cycle 
standard H13 revisions path. 


4.3 Reduction of Revisions of Concurrent Trend-Cycle 
Estimates 


Another important aspect to take into consideration is 
to reduce the total revision of the most recent estimate of 
the trend-cycle which is of preliminary character. Theo- 
retically, the final trend-cycle value is obtained after the 
series is extended with four years of data but the size of 
the revisions is negligible after three more months. 


Table 2 shows the mean absolute percent revision of the 
concurrent trend-cycle estimates over a four year period 
from January 1988 untill December 1991. The results 
indicate that for six of the nine cases analyzed the total 
revisions of the concurrent trend-cycle values using the 
modified procedure are much smaller compared to the 
standard, only for two series they are slightly larger. 


Table 2 


Mean Absolute Percent Total Revision of 
Concurrent Trend-Cycle 
Values Using the 13-Term Henderson Filter 


Standard Modified : 
Series Procedure Procedure Ratio 
(1) (2) (2)/() 
NODG LESS 1.10 0.73 
RSFA 0.62 0.47 0.76 
RSDG 0.77 0.62 0.80 
SIR 0.87 0.70 0.80 
AWM 0.13 0.12 0.92 
TS300 jg) bo? 1.07 0.95 
M1 0.35 0.35 1.00 
HSI 2.09 2.20 1.05 
BPSE 0.40 0.42 1.05 


5. CONCLUSION 


This paper introduced a new method for trend-cycle 
estimation which enables the use of the 13-term Henderson 
filter with the advantages of: (i) reducing the number of 
unwanted ripples in the final trend-cycle curves, (ii) reducing 
the size of the revisions to preliminary concurrent values, 
and (iii) not increase the time lag in turning point detection. 

The new method basically consists of extending a 
smoothed seasonally adjusted series (modified by extreme 
values with zero weight) with one year of ARIMA extrap- 
olations, and then applying the 13-term Henderson filter 
using strict sigma limits for the identification and replace- 
ment of outliers. 

The procedure is illustrated with nine leading indicator 
series of the Canadian Composite Leading Index and the 
results are highly satisfactory. 
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A Moving Stratification Algorithm 


YVES TILLE! 


ABSTRACT 


A general algorithm with equal probabilities is presented. The author provides the second order inclusion probabilities 
that correspond to the algorithm, which generalizes the selection-rejection method, so that a sample may be drawn 
using simple random sampling without replacement. Another particular case of the algorithm, called moving 
stratification algorithm, is discussed. A smooth stratification effect can be obtained by using, as a stratification 
variable, the serial number of the observation units. The author provides approximations of first and second order 
inclusion probabilities. These approximations lead to a population mean estimator and to an estimator of the variance 
of this mean estimator. The algorithm is then compared to a classical stratified plan with proportional allocation. 


KEY WORDS: Selection algorithm; Equal probability sampling; Strata. 


1. INTRODUCTION 


When a file is ordered according to an auxiliary variable 
that is close to the variable of interest, how can a sample 
be selected using such information? One solution to the 
problem consists of making a stratified selection. However, 
making such a selection requires that a delicate problem 
be resolved, namely subdividing the population into strata. 
Another simple solution that is both quick and efficient 
consists of making a systematic selection. The algorithm 
can be written in a few lines. Moreover, the way in which 
the file is ordered can be put to good use. However, a 
systematic selection has one major flaw, namely that 
estimating the variance of total or mean estimators requires 
one or several hypotheses concerning the population. 
It will be shown that there is another simple selection 
algorithm with which a sample can be drawn in one pass 
using the file ordering system. For this algorithm, an 
estimator of the variance of a total or mean estimator is 
provided, requiring no modelling of the population. 

A general selection algorithm providing equal first 
order inclusion probabilities is presented in section 2. First 
and second order inclusion probabilities are provided. In 
section 3, the proposed algorithm is shown to generalize 
the selection-rejection method so that a simple random 
sample can be drawn without replacement along with the 
stratified plan with proportional allocation. Finally, in 
section 4, the moving stratum method is defined and, in 
section 5, conclusions are drawn. 


2. PRESENTATION OF THE GENERAL ALGORITHM 


2.1 The Algorithm 


Let us consider a finite population U = {1,...,i,..., 
N}; we write y;, ..., Vj, .--,¥n, the N values assumed 


by variable y for N observation units of U. The mean of 
the values assumed by variable y for the population is 
written as 


Zi- 


y= 


De Ji- 
i¢U 


A random sample s of fixed size n is drawn from this 
population. The random variables indicating the presence 
of observation units in s are written as J;, 7 € U. The first 
order inclusion probability is written as 7; = Pr(i€s) = 
E(J;), i € Uand the second order inclusion probability as 
Tix = EU I,), i A k € U. The algorithm is very short. 
It resembles the algorithms of Fan, Fuller and Rezucha 
(1962), Bebbington (1975), McLeod and Bellhouse (1983) 
and Sunter (1977, 1986). Only N, n and the b;, i = 0, 

., N — 1 need to be known. The other variables are 
working variables. 


General Algorithm 
J <=-0; 
<2) 
Repeat fori = 0,...,N— 1 
u < = arandom number with a uniform distribution [0,1]; 


b; + i)n/N -—J 
TENS ae 
b; 1 ee 
otherwise, pass the record i + 1; 
esos yiadeodle 


select record i + 1; 


At each step, / represents the number of records already 
selected and i the number of records passed (selected or 
not). For each iteration, a decision is made about selec- 
ting the recordi + 1. Ifthe record is selected, it becomes 
the (j + 1)-thin the sample. The coefficients b;, i = 0, 

., N — 1, are strictly positive real numbers. These 
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quantities must meet certain conditions discussed below 
if the plan is to be of fixed size or if the units are to be 
selected with equal probability. The choice of different 
values for b;,i = 0, ..., N — 1, will make it possible to 
generate several special cases of the general algorithm. 

If b; are strictly positive reals such that b; s N — 1, 
then the sample size is equal to or smaller than n. In fact, 
assuming we have already drawn 7 units from the popula- 
tion at step 7 and that b; < N — ij, then 


(bj+i)n/N-n n nN-i hn n N-i 
=-- < —- —— — =0. 
N 


b; PEL ie EINES 


It becomes impossible to draw a further unit. It will be 
assumed in everything that follows that b; < N — i. 
Moreover, ifb; < N — i,i=1,...,N —n — landif 
b =N—-i,i= N-—n,...,N — 1, the sample is of 
fixed size n. Note that these conditions for obtaining a 
sample of fixed size are sufficient but not necessary. 
Three particular cases of the algorithm are examined 
below. These three cases are defined by three choices of 
coefficient b;,i = 0, ..., N — 1. Before examining these 
particular choices, we will determine the first and second 
order inclusion probabilities without loss of generality. 


2.2 First Order Inclusion Probabilities 


We write n;, the number of units selected after passing 
i records. We see immediately that n,, ..., nj, ..., NN 
is a Markov chain. In fact, we directly derive from the 
algorithm that 


Pr[nj =j | m, ...,m-1] = Prin; = J | nj_-1). 


The random variables 


b; + i)n/N — n; 
yee, we Sip Oe IN 1s 
b; 
can sometimes assume values greater than 1 or less than 
0. Since max(0,n —-N+ i) <n; S$ min(i,n), then 
PriOpsve seh ait 


Nea Ne 
min {7 ,N-—-i) if n <= N/2 
b; = > 


min (1 i /N- i) ifn > N/2 
N—-n 


Le= 0) eae No eh) 
Again conditions (1) are sufficient but not necessary. We 
can therefore construct b; which do not meet these con- 
ditions but which provide c; in [0,1]. The case dealt with 
in section 3.2 (stratification) represents one example. 
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The following example also provides c; in [0,1] 
without meeting condition (1): let us consider N = 12, 
n= 4and bb =), =), = b4=)6= 6, bo = bs = 7, 
b= N= t= 12 i= 71... 1 Webavec, = 1s. 
Cy = (7 — 3n,)/18, © = (3 — m)/7, C3 = (3 — n;)/6, 
Cc, = (10 — 374) /18, cs = (4 — ns)/7, cg = (4 — 1g) /6, 
cz = (4 — ny)/S, cg = (4 — 1g) /4, Co = (4 — N9)/3, 
Cio = (4 — Mo)/2, Cyy = (4 — 11). We note that 
Ayes lpn Side nees3adienySs13ythem c= O;and 
therefore ng < 3. We then have n; < 4 and if n5 = 4 
then c; = 0 and therefore ng < 4. This last co nment is 
true for all c; that follow. We therefore note that all c; are 
in [0,1] whereas b, = 6 does not meet condition (1). 

In order to simplify the demonstrations which follow, 
it will be assumed that 


PrlOts c= T= sb ii. 0 eo 


We will return to the problem of c; values greater than 
1 or smaller than 0 later on. If 


PrlOrs-crs —ty e — On ns 
we have 
Elij¢, |m, -...) = Eis: | a) = 


(b; + i)n/N — 1; 
b; ; 


It can be shown easily by recursion that if Pr[0 <c; < 1] = 


1, i= 077 ., Ne) Eli aN, C—O eee 
Therefore, 
n 
a; = E{i;) = E[n;) — Elnjy) = Fa (2) 


2.3 Second Order Inclusion Probabilities 


Four results provided by lemmas 1, 2 and 3 are needed 
in order to determine second order inclusion probabilities. 


Lemma 1. If Pr(0 s.c;s 1] =1, i=0,...,N.—1, 
then 


E{njsx | nj] 
i+k-1 
n in | 
er Gietik). eens ae Se EYR 
N (« x) II by 


witidNomolikyims shaves Nerealt 


This lemma can be demonstrated by recursion if it is 
assumed to be true for k — 1. Using lemma 1, the 
following lemma is readily obtained by subtraction: 
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Lemma2oif Pr[Ois‘cis 1]o= 1p: 10, 2s. N= 1, 
then 
Ellizx | ni] 


feel oN mA R= lew apd et. 


It is assumed by convention that an empty product has a 
value of 1. 


Lemma3 If Pr(0<c¢,<1] =1,i=0,...,N—1, 

then 

Veta anus 7 Pies ae 24 N. (3) 
I N N at fie by 9 Fates HAF Oe TLS ° 


The demonstration is provided in the appendix. 


Finally, the second order inclusion probability is pro- 
vided by the following proposition: 
Proposition 1 If Pr(O<c; <1] =1, /=0,.. 
then 


Be NARI 


VN rey fie 


Ne eto (4) 


The demonstration is provided in the appendix. 


Corollary 1 If Pr[(O<c; <1] =1, i=0,...,N-1, 
then 
nm nN—=n fab oer RECTED 
Poole ara Camara b 
2 ear r 
hinds al ae 
ie bates i ied} Ncaleks Ss 
De-1 be 


2.4 The Horvitz-Thompson Estimator and its Variance 


The Horvitz-Thompson estimator is the simple sample 
mean since the first order inclusion probabilities are all equal 


== eeu 


i€s 


<p 
BS) 
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If the design is of fixed size, we can use the Yates and 
Grundy variance formula (1953) 


N2 Ss ue Es Sah (Hae — ay). ©) 


i¢U keU 
k#i 


Vary, | 


Since z; = n/N, i = 1, ..., Nand assuming that 


Via At Saye 
n 
we can write 


Var [¥, | 


N2 Ne IC? a Ve)? Vik- (6) 


ieU keU 
k#i 


The variance estimator is provided by 


ee 1 ipa Ae 
Var|y, | Se ys Se @ =) foe Mik (7) 


iés ke€s 
k#i 


This can be written here as 


Vik 
Var|¥,| = = Oy ae 
hee Vix 
iés ké€s 
RAL 


3. APPLICATION 1: SIMPLE AND STRATIFIED 
RANDOM SELECTIONS 


3.1 Simple Design 


The simplest selection algorithm, the selection-rejection 
method described in Fan, Fuller and Rezucha (1962, 
method 1), Beddington (1975) and Deville and Grosbras 
(1987, p. 210), is of course a particular case of the general 
algorithm. We need only take 

Bisa IN aha hes Ose tiegs oielN gee bs 

We always have 0 < c; < 1. The first order inclusion 
probabilities always have a value of n/N. Calculations for 
second order inclusion probabilities follow from proposi- 
tion 1. Assuming k > i, on the basis of corollary 1, we can 
find the second order inclusion probabilities of the simple 
design: 


n(n — 1) 
Ul Rae Te 
N(N — 1) 
We also recall some classical results concerning the simple 
design that we will be using later on. The estimator for y 
is therefore the mean of the sample 
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x 1 
Vers = Ps yi: (8) 


i€s 


The variance of this estimator is provided by 


2 
x oy N-—n 
Var|Jes| = — gaa; (9) 
where 
1 
Ci No (yj — ¥)?. (10) 
i¢U 
An unbiased estimate of this variance is 
—~Ta sg N —n 
Var |¥ars| = mh Main? (11) 
where 
1 - 
Sy a Pea, we (9 — Vsrs)*- (12) 


3.2 Stratified design 


The stratified design can also be defined using the general 
algorithm. The stratification variable in this case is the serial 
number of the individual. Let us consider the particular 
case of a stratified design of H strata with proportional 
allocation where all the strata are of the same size. The 
strata are such that the individuals of a given stratum are 
adjacent in the data file. It is also assumed that N/H is an 
integer. This stratified design is obtained by simply taking 


N 
b; = {wi 1) moa} et = 0, NN 


4. APPLICATION 2: MOVING STRATIFICATION 


4.1 The Problem 


The file is assumed to be ordered according to an aux- 
iliary variable that is close to the variable of interest. The 
problem is as follows: how can we draw a random selection 
that yields a small variance for the Horvitz-Thompson 
estimator of a mean? Looking at the formulation of the 
Yates-Grundy variance (5), we see that there are two 
distinct answers to this question. 

The first solution consists of selecting with unequal 
probabilities using first order inclusion probabilities that 
are proportional to the variable of interest. If such a selec- 
tion could be made, all quantities 


PEIENS 
Tj Wk 


would be zero and therefore the variance would be zero. 
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The second solution consists of using second order 
inclusion probabilities. A good selection could be one 
where 77, are close to 1;7;, if y; is very different from y,. 
On the other hand, if y; is very close to y,, we can select 
second order inclusion probabilities 7; that are clearly 
smaller than 2;7,. Thus, where quantities 


a 

Tj Tk 

would be large (respectively small), quantities m)7,% — Tix 
would be small (respectively large). We would thus have 
a small variance. 

The second solution we have just described is in fact often 
used. It is the basic idea for stratification. Our objective 
is to apply this idea to the construction of a sequential selec- 
tion algorithm that is easy to implement. Such an algorithm 
could be applied to any file without the need to know 
anything save the size of the population. It would therefore 
apply to very large files. We could thus benefit from the 
information provided by this auxiliary variable like for strati- 
fication, without the need to actually subdivide into strata. 


4.2 The Method 


We first define M the length of the moving stratum 
within the population. M represents, in a way, the size 
of the stratum within the population and is such that 
N/n < M < N. Thealgorithm of the moving stratum is 
defined by 


by = min({ MAN pe Dal H20,g.. Na): 


There is, however, one problem. Quantities c; defined 
by 


(M + i)n/N — 1; 
M 


if ~1<N-—-M 


nee f 
—— otherwise, 
N-1 


are not always in [0,1]. 


In fact, let us assume that, before the (N — M)-thstep of 
the algorithm, c; is positive and very close to zero and that 
through some bad luck the unit jis nevertheless chosen. In 
such a case, c;,, would have a value of c; — (N—n)/(NM). 
C;,1 can therefore have a negative value but this negative 
value is always greater than — (N—17)/(NM). In fact, if 
one of the c; is already negative, the unit 7 is not selected 
and therefore c;,, has a value greater than c;. 

Let us now assume that before the (N — M)-th step 
of the algorithm, one c; is very slightly smaller than 1 and 
that nevertheless unit jis not selected. In such a case, c;41 
would have a value of c; + n/(NM). c;,, can therefore 
take on a value greater than | but this value greater than J 


Survey Methodology, June 1996 


is nevertheless always smaller than 1 + n/(NM). In fact, 
if one of the c; is already greater than 1, the unit 7 is always 
selected and therefore c;, , has a value smaller than cj. 


We obtain 


N-n n , 
Pr| — —— <c,<1+—]| =1,i=0,...,N—-M. 
NM NM 
(13) 


The design is however of fixed size, a result that follows 
the following proposition: 


Proposition 2 If b; = min(M, N — i), (N/n<M<N), 
0 = 1, ..., N — 1, then the design is of fixed size. 


The demonstration is provided in the appendix. 


Since the c; are not always within the interval [0,1], 
we carried out 50 simulations of the moving stratum 
algorithm for various sample and population sizes. The 
selected N population sizes were 100, 500, 2500, 12500, 
62500, 312500. The reciprocals of sampling rates (NV/n) 
were 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096. We 
carried out several simulations by varying the size of the 
moving stratum as follows: M = N/n, 2N/n,3N/n, .... 
The simulations seem to indicate that the greater the value 
for M, the smaller the probability that a c; will fall outside 
of [0,1]. Assoonas M = 10N/n, for all the simulations 
that we carried out, the problem was no longer raised. 
This first result does not imply that the probability that 
at least one of the ¢; will fall outside of [0,1] is zero when 
M = 10N/n. However, it may be said that such a prob- 
ability would then be very small. 


4.3 Estimating the Mean and Bias 


In examining the results yielded by expression (2) and 
proposition 1, we get, as a first approximation, a value of 
about 2; ~ n/N for first order inclusion probabilities. 
This approximation of inclusion probabilities makes it 
possible to construct an estimator. 


”3 1 
Ysm olin Le %- 


This estimator is slightly biased since the c; are not all 
exactly within the interval [0,1]. This bias is 


F 1 
Bl ¥on| Se ye Qj Yi 
Ny icU 
where a; = 1; N/n — 1. Since the design is of fixed size, 
¥ icy @; = 0. We can therefore write the bias in the form 
of a covariance: B[ Ym] = oy, where 


1 
yee = = Ai Yi — y). 14 
lane ui O79) (14) 
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Since the absolute value of a covariance is always equal 
to or smaller than the product of the two standard devia- 
tions, we obtain an upper bound for the absolute value of 
the bias 


| Bl Yom| | ZS yy 


where oy is defined by (10) and 


The variance of the estimator is of a magnitude that is 
comparable (for N and fixed n) to the variance of the 
estimator of the mean in the simple design without replace- 
ment. We can therefore write 


| BL Yom] | < Ca VWarl Ysrs| 


where Var [Fars | is defined by (9) and 


N-1 
oa preleesillly 
(No 11) 
We will assume that the bias is negligible when the upper 
bound of the bias of the estimator Y,,,, is negligible with 
respect to Var|Ysrs | %  i.e., when C,, is small. 


Recursively we can calculate the exact value of the 
Pr[n; = j] since we have 


Pri, =1|n] =4é,i=1,....N—-M 


where é; has a value of 0 if c; < 0, c;if 0 < ¢; < 1 and 
lifc; > 1. From this result we can derive the exact value 
of first order inclusion probabilities. 

We have calculated (Appendix, Table 1) the values of 
C, for various sample and population (100 - 312500) 
sizes. The values of C, are provided for sizes of moving 
strata M equal to N/n, 2N/n, 3N/n, 4N/n and 5N/n. It 
can be seen that as soon as the value of the moving stratum 
is 2N/n, C,, never exceeds 0.07. When M = 3N/n, the 
coefficient C,, is expressed in thousandths. According to 
Cochran (1977, pp. 13-14), the bias is then negligible. The 
table therefore shows that if M = 3N/n, the bias of the 
estimator will be negligible at least for the specified sample 
and population sizes. 

However, these results do not imply that the bias of the 
estimator is large when M is very small (for example 
M = N/n). The C, are bias upper bounds. From expres- 
sion (14), we see that the bias will be all the greater as the 
variable of interest correlates with the exact inclusion 
probabilities. We have shown (Figure 1) the exact inclusion 
probabilities (y axis) for N individuals (x axis) obtained 
by using the moving stratification algorithm with the 
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Figure 1. Inclusion probabilities. 


parameters) IN i=fo1h 7 — ie Vie Nie hisecasens 
obviously very unfavourable. The result is interesting. In 
this case, n/N = 0.215686. The inclusion probabilities are 
distributed on both sides of n/N with no marked tendency 
associated with the ordering of the file. In practical terms, 
the probability can be considered very small that there will 
be a variable of interest that strongly correlates with the 
exact inclusion probabilities; as a result, the bias will most 
often be clearly smaller than the given upper bound. 

We could, of course, use the exact inclusion probabilities 
to establish an estimate. We feel that this is not worthwhile, 
for two reasons: 


¢ first, because calculating the exact inclusion probabilities 
requires a significant amount of time, 


© second, because the exact first order inclusion proba- 
bilities are such that 


var] ye =| zl) 


5 Tj 
1€s 


In this case, we have a random Horvitz-Thompson esti- 
mator of a constant variable (y, = C). To overcome this 
problem, an estimate of the mean is usually carried out 
using Hajek’s (1971) ratio. This estimator is also biased. 


4.4 Estimating the Variance of the Estimator 


Assuming that Pr(0 < c; < 1) = 1, we can also build 
an approximation of second order inclusion probabilities 
using corollary 1. Given that b; has a value of M if 
i =< N—MandN — i otherwise, we obtain the following 
approximation: 


where 
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N—n7n 1 Ma min(i/—1,N—M) 
Bix = 1 a 
2n M-1 M 


= ak 7) yee ei 
x 


Kak: 


Assuming that the first order inclusion probabilities have 
a value of n/N, an approximation of the variance of ¥,,, 
can be obtained: 


rs 1 
VatapplYom] = 5 YY Oi %4)70e- (1S) 


i¢U keU 
kFi 


From (15), an estimator of the variance of the estimator 
of the mean can be obtained: 


_~ ~ 1 6; 
Vary a Wen ar Si avs: Cilek same 


1 — 6, 
i¢s k€s tk 
k#i 


Again, this estimator is biased. In order to assess the 
magnitude of the bias, we carried out a series of simula- 
tions. The results are given in Table 2 in the appendix. 
We generated populations of size N = 400. The values 
assumed by the two variables x and y were generated by 
means of pseudo-random numbers having a bivariate 
normal distribution with a fixed coefficient of correlation 
p. The populations were then sorted in terms of the 
variable x. The objective was to estimate j. 

In these populations, samples of size 64 were selected 
using the moving stratum method (sm), a stratified design 
with proportional allocation in which the sizes of the strata 
were all equal (strat), as well as a simple design without 
replacement (srs). These three methods are particular 
cases of the general algorithm and they were implemented 
using the same random numbers. Simulations were carried 
out for different values of the moving stratum M (case: sm) 
and for different numbers of strata H (case: strat). An 
explanation is provided below for the choices of M and 
H. For each simulation, 200,000 samples were selected. 


For each of the simulations, three results are given: 


¢ The means for the simulations of the estimators of the 
variance of the estimator of the mean, which are ex- 
pressed as EsimV at (¥). These variance estimators are 
given by expressions (11) (srs) and (16) (sm). 


¢ The mean-square errors for the simulations of the esti- 
mators of the mean. These quantities are expressed as 
EQMim (v) = Esim (9 =) 2 

e The variances of the estimators of the mean. These 
variances are given by expressions (9) (srs) and (15) 
(sm). In the case of the moving stratification, this is of 
course the proposed approximation. 
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A careful reading of the results seems to indicate that 
the variance estimator proposed for the moving stratum 
algorithm is not affected by a systematic bias no matter 
what the value for the coefficient of correlation between 
x and y. The results also seem to indicate that the approx- 
imate expression given for the variance of the estimator 
of the mean for the moving stratification is a valid 
approximation. 


4.5 Interest of the Algorithm 


Within the class of algorithms defined by the general 
algorithm, we call the mean horizon of an algorithm the 
quantity 


b= 


2 P= 


N= 
Snes 
i=0 


For the simple design, we get b,,, = (N + 1)/2. For the 
algorithm of the moving stratum, we have 


1 N-M-1 N-1 
bm = 5 | } M + Ds iva} 
i=) i=N-—M 


Let us now assume that, as described in section 3.2, we 
select a sample using a design with proportional allocation 
in which all the strata are of the same size and in which 
the sizes of H strata are all equal. In such a design, the 
mean horizon has a value of 


a hae / tN 
oetesaay hasaeung ia 
we => (Ge #1) 


A change in the mean horizon does not fundamentally 
affect the first order inclusion probabilities. The second 
order inclusion probabilities, on the other hand, are 
strongly affected by a change of horizon. In fact, it can 
easily be seen that the smaller the mean horizon, the 
smaller the probability of selecting two close individuals. 
(Two individuals are said to be close if the absolute value 
of the difference of their serial numbers in the data file is 
small.) Intuitively, we can expect the moving stratum 
algorithm to have a stratification effect similar to that of 
a stratified design with proportional allocation having the 
same mean horizon, i.e., when 
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Dstrat = Osms 


or in other words, when 


(17) 


When Nis large in relation to M, we have approximately 


2N 
ave 


For each series of simulations presented in the Appendix 
(Table 2), the sizes of the moving strata (case: sm) were 
fixed in terms of the number of strata (case: strat) in such 
a way that the mean horizons of the two designs were 
identical in terms of expression (17). It is observed that, 
in such a case, the increased precision (compared to that 
of the simple design) derived from the moving stratum 
algorithm is of the same order of magnitude as that derived 
by means of stratification. 


5. COMMENTS 


The simulations that were carried out clearly show that 
the moving stratification algorithm yields a stratification 
effect of the same type as classical stratification with 
proportional allocation. This algorithm makes it possible 
to study the delicate problem of subdividing a continuous 
variable into strata. The estimators of the mean that 
are proposed are slightly biased. However, as long as 
M = 10N/n, simulations show that it is extremely rare for 
at least one of the c; to fall outside of [0,1]. Moreover, 
we have shown that even when that probability is not zero, 
the bias of the estimator that we propose is negligible as 
long as M = 3N/n. 
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APPENDIX 1 


Demonstration of the Lemmas and Propositions 
Demonstration of Lemma 3 


Var [7;41] 


= Var[n;] + Var[Jj+1] 


+ 2 (ef (m a in) E| her one 


El 26| (n y is) = +d) nN | 
N b; N 


—2 
=) Vania 
i 


we obtain 


b; — 2 N —- 
Var[nj.,] = Var[n;] — ‘as 5 


geNIewle (18) 


We then show that (3) verifies the recursion equation (18) 
and the initial condition given by 


N-—n 
oe 


Var(n,) = 


ZA 


Demonstration of Proposition 1 
Case 1: 7 = 0. From lemma 2 we immediately get: 


E(i,d\) = E(EU, | mJ) 
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Case 2: i > 0. Using lemma 2, we obtain: 
Ellis etias (nj = 0] 
= Elfisg( Min. =t+ VEU 4114 = 7) 


By Oe (ESN TEE she ye GG 
= ( ae I] a 


t=i+1 


Which means that 


E(E (is ctie: | 7i)) 


1 Ee 
= fe (ima) Il a 
N NJ ORES ea 


Lemma 3 thus gives us Var[n;]. We immediately obtain 


(4). 


Demonstration of Proposition 2 


Using (13), we have 


Ne 
Pr[n MF < mw < Sn] = 1. 


Therefore, 
Pr(Os'27—"ny_y = M] =P: 


Beginning with step N — M, the algorithm is a selection- 
rejection algorithm of the type described in section 3.1. This 
algorithm yields a sample of exactly n — ny_ jy observa- 
tion units during the final M steps. Since n — ny_y < M, 
this operation raises no difficulty and the algorithm is 
therefore of fixed size n. 
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APPENDIX 2 


Tables, Bias Upper Bounds and Simulations 


Table 1 
Value of the Bias Upper Bounds C, 


Value of the Coefficient Cy 


N n N 2N 3N 4N SN 
Mets as pk Se ey Mier. 
n n n n n 
100 50_0.000000 0.000000 0.000000 0.000000 ~—_ 0.000000 
25 0.057326 0.002610 0.000185 0.000015 0.000001 
12 0.041716 0.002604 0.000235 0.000023 0.000002 
6 0.032227 0.002029 0.000134 0.000005 +—_0.000000 
3 0.023515 0.000645 0.000000 
500 250 0.000000 0.000000 0.000000 0.000000 ~—_ 0.000000 
125 0.129091 0.006002 0.000437 0.000038 0.000004 
62 0.090863 0.005664 0.000534 0.000059 —_ 0.000007 
31 0.066891 0.004666 0.000484 0.000059 ~—_0,000008 
15 0.048544 0.003586 0.000384 0.000046 0.000006 
7 0.035508 0.002552 0.000215 0.000015 0.000001 
3 0.024046 0.000699 0.000000 
2,500 1,250 0.000000 0.000000 0.000000 0.000000 0.000000 
625 0.289060 0.013495 0.000987 0.000086 0.000008 
312 0.202458 0.012607 0.001190 0.000133 0.000016 
156 0.147113 0.010234 0.001064 0.000130 0.000017 
78 0.105662 0.007742 0.000841 0.000107 0.000015 
39 0.075975 0.005719 0.000634 0.000082 0.000012 
19 0.054525 0.004174 0.000466 0.000060 0.000008 
9 0.039560 0.003014 0.000301 0.000029 0.000002 
4 0.028388 0.001451 0.000034 0.000000 
12,500 3,125 0.646539 0.030208 +~—«0.002211 0.000193 0.000018 
1,562 0.452450 0.028177 0.002661 0.000297 0.000036 
781 0.327879 0.022798 0.002371 0.000290 0.000039 
390 0.234114 0.017131 0.001863 0.000238 0.000033 
195 0.166626 0.012500 0.001388 0.000181 0.000026 
97 0.118357 0.008995 0.001009 0.000133 0.000019 
48 0.084217 0.006452 0.000727 0.000096 0.000014 
24 0.060797 0.004689 0.000529 +—0.000069 0.000010 
12 0.044677 0.003461 0.000377 0.000044 0.000005 
0.033727 0.002356 0.000173 0.000008 0.000000 
3 0.024172 0.000712 0.000000 
62,500 3906 0.732684 0.050942 0.005299 0.000649 0.000087 
1,953 0.522918 0.038250 0.004159 0.000531 0.000074 
976 0.371301 0.027833 0.003092 0.000403 0.000057 
488 0.263300 0.019979 0.002243 0.000295 0.000042 
244 0.186736 0.014259 0.001609 0.000213 0.000031 
122 0.132653 0.010168 0.001150 0.000152 0.000022 
61 0.094601 0.007273 0.000823 0.000109 0.000016 
30 0.067467 0.005207 0.000590 0.000078 ~—_-0.000011 
15 0.049227 0.003820 0.000427 0.000054 0.000007 
7 0.035847 0.002637 0.000227 (0.000016 0.000001 
3 0.024176 0.000713 0.000000 
312,500 4,882 0.829762 0.062191 0.006909 0.000901 0.000128 
2,441 0.587909 0.044596 0.005006 0.000659 ~—_ 0.000095 
1,220 0.416165 0.031758 0.003583 0.000474 0.000068 
610 0.294647 0.022555 0.002551 0.000339 0.000049 
305 0.208743 0.016008 0.001813 0.000241 0.000035 
152 0.147877 0.011356 0.001287 0.000171 0.000025 
76 0.105272 0.008098 0.000918 0.000122 0.000018 
38 0.075422 0.005817 0.000659 0.000087 0.000013 
19 0.054695 0.004238 0.000479 0.000062 0.000009 
9 0.039644 0.003038 0.000305 0.000030 0.000002 
4 0.028427 0.001457 0.000034 0.000000 
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Table 2 


Results of the Simulations, Simple Design, Stratification 
and Moving Stratification 


pe Plan Parameters EsimVat 3 Vary EQM.imy 
0.0 sm M = 18.83N/n 0.01318 0.01317 0.01301 
Srs 0.01317 0.01316 0.01296 
strat H=2 0.01319 0.01319 0.01318 
0) sm M = 18.83N/n 0.01210 0.01210 0.01187 
STS 0.01316 0.01316 0.01287 
strat H=2 0.01172 0.01188 0.01164 
0.4 sm M = 18.83N/n 0.01073 0.01073 0.01080 
srs 0.01316 0.01316 0.01320 
strat H=2 0.00943 0.00929 0.00946 
0.6 sm M = 18.83N/n 0.00957 0.00957 0.00954 
STs 0.01315 0.01316 0.01301 
strat H=2 0.00783 0.00778 0.00774 
0.8 sm M = 18.83N/n 0.00839 0.00839 0.00839 
STS 0.01315 0.01316 0.01322 
strat H=2 0.00630 0.00624 0.00622 
1.0 sm M = 18.83N/n 0.00757 0.00757 0.00760 
Srs 0.01314 0.01316 0.01319 
strat i = 0.00514 0.00508 0.00513 
0.0 sm M = 8.65N/n 0.01319 0.01319 0.01317 
srs 0.01317 0.01316 0.01296 
strat H=4 0.01320 0.01318 0.01316 
0.2 sm M = 8.65N/n 0.01107 0.01107 0.01084 
STS 0.01316 0.01316 0.01287 
strat H=4 0.01080 0.01076 0.01054 
0.4 sm M = 8.65N/n 0.00876 0.00876 0.00882 
STs 0.01316 0.01316 0.01320 
strat H=4 0.00811 0.00793 0.00796 
0.6 sm M = 8.65N/n 0.00695 0.00694 0.00688 
STs 0.01315 0.01316 0.01301 
strat T= 4 0.00637 0.00639 0.00632 
0.8 sm M = 8.65N/n 0.00484 0.00484 0.00485 
STS 0.01315 0.01316 0.01322 
strat H=4 0.00402 0.00391 0.00390 
1.0 sm M = 8.65N/n 0.00312 0.00312 0.00313 
STs 0.01314 0.01316 0.01319 
strat H=4 0.00206 0.00197 0.00197 
0.0 sm M = 4.21N/n 0.01317 0.01317 0.01316 
STS 0.01317 0.01316 0.01296 
Strat H=8 0.01321 0.01324 0.01325 
0.2 sm M = 4.21N/n 0.01067 0.01067 0.01046 
STs 0.01316 0.01316 0.01287 
strat H=8 0.01055 0.01047 0.01025 
0.4 sm M = 4.21N/n 0.00810 0.00809 0.00808 
Srs 0.01316 0.01316 0.01320 
strat H=8 0.00794 0.00789 0.00789 
0.6 sm M = 4.21N/n 0.00592 0.00592 0.00588 
STS 0.01315 0.01316 0.01301 
strat H=8 0.00575 0.00564 0.00561 
0.8 sm M = 4.21N/n 0.00344 0.00344 0.00345 
STs 0.01315 0.01316 0.01322 
Strat H=8 0.00315 0.00311 0.00308 
1.0 sm M = 4.21N/n 0.00124 0.00124 0.00125 
STs 0.01314 0.01316 0.01319 
strat H=8 0.00085 0.00079 0.00080 
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Table 2 


Results of the Simulations, Simple Design, Stratification 
and Moving Stratification - end 


pe Plan Parameters EsimVat Y Var y EQMsimY 
0.0 sm M =2.11N/n 0.01319 0.01319 0.01328 
srs 0.01315 0.01316 0.01332 
strat H = 16 0.01315 0.01308 0.01331 
0.2 sm M = 2.11N/n 0.01038 0.01036 0.01021 
srs 0.01317 0.01316 0.01334 
strat H = 16 0.01034 0.01034 0.01025 
0.4 sm M = 2.11N/n 0.00796 0.00796 0.00792 
srs 0.01316 0.01316 0.01323 
strat H = 16 0.00790 0.00801 0.00794 
0.6 sm M = 2.11N/n 0.00572 0.00573 0.00561 
srs 0.01315 0.01316 0.01299 
strat H = 16 0.00568 0.00572 0.00563 
0.8 sm M =2.11N/n 0.00295 0.00294 0.00290 
srs 0.01317 0.01316 0.01325 
strat H = 16 0.00287 0.00288 0.00285 
1.0 sm M = 2.11N/n 0.00048 0.00048 0.00048 
srs 0.01317 0.01316 0.01335 
Strat H = 16 0.00037 0.00034 0.00034 
0.0 sm M = 1.09N/n 0.01325 0.01316 0.01310 
STs 0.01313 0.01316 0.01317 
strat H = 32 0.01201 0.01239 0.01302 
0.2 sm M = 1.09N/n 0.01070 0.01062 0.01064 
srs 0.01313 0.01316 0.01316 
strat H = 32 0.00972 0.01018 0.01083 
0.4 sm M = 1.09N/n 0.00807 0.00803 0.00811 
STs 0.01315 0.01316 0.01309 
strat H = 32 0.00732 0.00751 0.00803 
0.6 sm M = 1.09N/n 0.00538 0.00534 0.00536 
srs 0.01315 0.01316 0.01310 
strat H= 32 0.00484 0.00484 0.00543 
0.8 sm M = 1.09N/n 0.00283 0.00281 0.00276 
srs 0.01317 0.01316 0.01283 
strat AH = 32 0.00255 0.00276 0.00280 
Olga ai M = 1.09N/n 0.00016 0.00016 0.00017 
srs 0.01317 0.01316 0.01304 
strat = 32 0.00012 0.00007 0.00011 
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A View on Statistical Disclosure Control for Microdata 


A.G. de WAAL and L.C.R.J. WILLENBORG! 


ABSTRACT 


Problems arising from statistical disclosure control, which aims to prevent that information about individual 
respondents is disclosed by users of data, have come to the fore rapidly in recent years. The main reason for this 
is the growing demand for detailed data provided by statistical offices caused by the still increasing use of computers. 
In former days tables with relatively little information were published. Nowadays the users of data demand much 
more detailed tables and, moreover, microdata to analyze by themselves. Because of this increase in information 
content statistical disclosure control has become much more difficult. In this paper the authors give their view on 
the problems which one encounters when trying to protect microdata against disclosure. This view is based on their 
experience with statistical disclosure control acquired at Statistics Netherlands. 
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1. INTRODUCTION 


Statistical disclosure control (SDC) is becoming increas- 
ingly important as a result of the growing demand for 
information provided by statistical offices. The informa- 
tion released by these statistical offices can be divided into 
two major parts: tabular data and microdata. Whereas 
tables have been released traditionally by statistical offices, 
microdata sets are released only since fairly recently. In 
the past the users of data usually did not have the tools 
to analyze these microdata sets properly themselves. 
Nowadays every serious researcher is in possession of a 
powerful personal computer. Analyzing microdata is 
therefore no longer a privilege of the statistical office. The 
users of data can and want to analyze these microdata 
themselves. This creates non-trivial SDC-problems. 

A key problem in the theory of SDC for microdata is 
the determination of the probability that a record in a 
released microdata set is re-identified. In order to estimate 
this probability a number of different approaches have 
been attempted. The aim of these attempts differ consider- 
ably. In some publications the aim was to gain a qualitative 
insight into the probability of re-identification of an 
unspecified record from a microdata set. In other publica- 
tions the aim was set much higher, namely to obtain the 
probability that a specific record is re-identified. These are, 
of course, extreme cases. The former case is comparatively 
easy to solve, although still difficult. The latter case is more 
difficult and may be impossible to solve. 

In this paper we give an overview of the problems for 
which Statistics Netherlands has attempted to provide a 
solution and problems of which the suggested solution has 
attracted our attention. We consider the problems and 
their outline of the solutions, while technical points are 


skipped. The choice of the problems and the possible 
solutions we consider is heavily influenced by the expe- 
riences of Statistics Netherlands in the field of SDC. 

The rest of this paper is organized as follows. Basic 
concepts are defined in Section 2. Preliminaries on SDC 
for microdata are the subject of Section 3. Our basic phi- 
losophy of SDC for microdata is discussed in Section 4. 
In Section 5 we describe the ideal situation for microdata: 
in this case we would have a probability for each record 
that this specific record can be re-identified. A somewhat 
less ideal situation is described in Section 6: in this case 
we have a probability for a data set that an unspecified 
record can be re-identified. In Section 7 we have to face 
reality: at the moment we do not have a good disclosure 
risk model and we have to be satisfied with heuristic 
arguments. In Section 8 we summarize our conclusions 
and suggest some possibilities for future research. 


2. BASIC CONCEPTS 


In this section a number of basic concepts are defined. 
We will assume that the statistical office wants to release 
a microdata set containing records of a sample of the 
population. Each record contains information about an 
individual entity. Such an entity could be a person, a 
household or a business enterprise. In the rest of this paper 
we will usually consider the individual entity to be a 
person, although this is not essential. 

The two most important concepts in the field of SDC 
are re-identification and disclosure. Re-identification is 
said to occur if an attacker establishes a one-to-one rela- 
tionship between a microdata record and a target indi- 
vidual with a sufficient degree of confidence. Following 
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Skinner (1992) we distinguish between two kinds of disclo- 
sure. Re-identification disclosure occurs if the attacker is 
able to deduce the value of a sensitive variable for the target 
individual after this individual has been re-identified. 
Prediction disclosure (or attribute disclosure) occurs if the 
microdata enable the attacker to predict the value of a sen- 
sitive variable for some target individual with a sufficient 
degree of confidence. For prediction disclosure it is not 
necessary that re-identification has taken place. Most 
research so far has concentrated on re-identification 
disclosure. In this paper we will use the term disclosure to 
indicate re-identification disclosure unless stated otherwise. 

Now, let us define what is meant by an identifying 
variable. A variable is called identifying if it can serve, 
alone or in combination with other variables, to re-identify 
some respondents by some user of the data. Examples of 
identifying variables are residence, sex, nationality, age, 
occupation and education. A subset of the set of identifying 
variables is the set of direct (or formal) identifiers. Examples 
of direct identifiers are name, address and public iden- 
tification numbers. Direct identifiers must have been 
removed from a microdata set before it is released for else 
re-identification is very easy. Other identifiers in most 
cases do not have to be removed from the microdata set. 
A combination of identifying variables is called a key. The 
identifying variables that together constitute a key are also 
called key variables. A key value is a combination of scores 
on the identifying variables that together constitute the key. 

In practice, determining whether or not a variable is 
identifying is a problem that can only be solved by sound 
judgment. No limitative list of intrinsically identifying 
variables exists, nor, for that matter, an unambiguous and 
well-defined set of rules to determine such variables. 
Selecting a set of identifying variables, and therefore of 
keys, is generally based on subjective assumptions about 
the population. Statistics Netherlands applies some criteria, 
like the visibility of the categories of a variable, to deter- 
mine whether or not a variable is identifying, but these 
criteria do not provide a definite answer to this problem 
for all variables. Whether or not a variable is considered 
identifying is essentially a matter of judgment. In the 
remainder of this paper we will assume however that a set 
of keys has been determined. 

The counterparts of identifying variables are the sensi- 
tive (or confidential) variables. A variable is called sensitive 
(or confidential) if some of the values represent character- 
istics a respondent would not like to be revealed about him. 
In principle, Statistics Netherlands considers all variables 
sensitive, but in practice some variables are considered 
more sensitive than others. Like in the case of identifying 
variables, determining whether or not a variable is sensitive 
can be solved only by sound judgment in practice. The 
variables sexual behavior and criminal past are generally 
considered sensitive, but for other variables this may 
depend on, for instance, cultural background. Keller and 


Bethlehem (1992) give as an example the variable income. 
In the Netherlands income is considered sensitive, whereas 
in Sweden it is not. Moreover, there are variables which 
should be considered both identifying and sensitive. An 
example of such a variable is ethnic membership. However, 
in the literature it is usually assumed that the identifying 
and sensitive variables can be divided into disjoint sets. In 
the remainder of this paper we will also assume that a set 
of sensitive variables has been determined which is disjoint 
from the set of identifying variables. 

By using information about the identifying variables a 
potential attacker can try to disclose information about 
sensitive variables. Note that this way of disclosure is only 
possible in case the link between the values of the identi- 
fying variables and the values of the sensitive variables has 
not been perturbed by noise in the data or by a technique 
like data-swapping. 

To end this section, we give a definition of SDC. 
Statistical disclosure control aims to reduce the risk that 
sensitive information of individual persons can be disclosed 
to an acceptable level. What is acceptable depends on the 
policy of the data releaser. In order to reduce the risk of 
disclosure an estimate for the risk of disclosure would 
be very helpful although it is not a necessary requisite 
(cf. Section 7). Some research has been devoted to defining 
and estimating this risk of disclosure. 


3. PRELIMINARIES ON SDC FOR 
MICRODATA 


As acustomer of a Statistical office, the user of a micro- 
data set should be satisfied with its quality. The user is 
usually not interested in individual records, but only in 
statistical results which can be drawn from the total set of 
records. For instance, he wants to examine tables he has 
produced himself from the microdata set. 

Because a microdata set is meant for statistical analysis 
it is not necessary that each record in the set is correct. 
The statistical office has the possibility to perturb records, 
e.g., by adding noise or by swapping parts of records 
between different records, in order to reduce the risk of 
re-identification. By perturbing records the risk of re- 
identification is reduced because even when a correct re- 
identification takes place the information which is disclosed 
may be incorrect. In any case the attacker cannot be sure 
that the disclosed information is correct. The statistical 
office ‘only’ has to guarantee that the statistical quality 
of, for instance, the tables the user wants to examine is 
high enough. This may be quite complicated to achieve in 
practice, however. 

Although data perturbation methods may prove to be 
useful, for the time being Statistics Netherlands does not 
use them. To protect its microdata sets Statistics Nether- 
lands applies local suppression and global recoding only. 
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When local suppression is applied some values of variables 
in some records are set to ‘missing’, i.e., deleted from the 
microdata set. When global recoding is applied some 
variables are given a coarser categorization. In a first step, 
we try to protect a microdata set by means of global 
recoding. However, when protecting a microdata set 
entirely by means of global recodings would result in a 
considerable information loss, we apply local suppressions 
as well. In this way we try to avoid that too much infor- 
mation will be lost. It should be clear that local suppres- 
sions are only applied parsimoniously. 

An advantage of local suppression and global recoding 
is that these techniques preserve the integrity of the data. 
A disadvantage of local suppression is that it introduces 
a bias, because extreme values will be locally suppressed. 
However, when local suppressions are only applied parsi- 
moniously, this bias will be small. 

From the SDC point of view a user of the data should 
also be looked upon as a potential attacker. Hence, it is 
useful to consider the ways in which disclosure can take 
place. An attacker tries to match records from the micro- 
data set with records from an identification file or with 
individuals from his circle of acquaintances. An identifica- 
tion file is a file containing records with values on direct 
identifiers and values on some other identifiers of the 
microdata set. The latter identifiers may be used to match 
records from the released microdata set with records from 
the identification file. After matching the direct identifiers 
in the identification file can be used to determine whose 
record has been matched, and the sensitive variables in the 
released microdata set can be used to disclose information 
about this person. A circle of acquaintances is the set of 
persons in the population for which the attacker knows 
the values on a certain key from the microdata set. So, a 
circle of acquaintances could actually be an identification 
file, and vice versa. In the rest of this paper we will 
therefore use the terms ‘identification file’ and ‘circle of 
acquaintances’ interchangeably. 

In order for re-identification of a record of an individual 
to occur the following conditions have to be satisfied: 


C,. The individual is unique on a particular key value K. 

C,. The individual belongs to an identification file or a 
circle of acquaintances of the attacker. 

C;. The individual is an element of the sample. 

C,. The attacker knows that the record is unique in the 
population on the key K. 

C;. Theattacker comes across the record in the microdata 
set. 

Cs. The attacker recognizes the record of the individual. 


Whenever one of the conditions C; to Cg does not 
hold, re-identification cannot be accomplished with abso- 
lute certainty. If either condition C, or C4 does not hold, 
then a matching can be made but the attacker cannot be 
sure that this leads to a correct re-identification. 
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It is clear from the conditions C, to Cg that a ‘good’ 
model for the risk of re-identification should incorporate 
aspects of both the data set and the user. When a Dutch 
microdata set is used by someone in, say, China who is 
essentially unfamiliar with the Dutch population, then the 
risk of re-identification is negligible. In order to re-identify 
someone in a microdata set it is necessary to acquire 
sufficient knowledge about the population. The amount 
of work that should be done to acquire this knowledge is 
proportional to the safety of the microdata set. 


4. A PHILOSOPHY OF SDC 


It seems likely that the attention of a potential attacker 
is drawn by combinations of identifying variables that are 
rare in the sample or in the population. Combinations that 
occur quite often are less likely to trigger his curiosity. If 
he tries to match records deliberately then he will probably 
try to do this for key values that occur only a few times. 
If the user does not try to match records deliberately, but 
he knows an acquaintance with a rare key value then a 
record with that particular key value may trigger him to 
consider the possibility that this record belongs to this 
acquaintance. Moreover, the probability of a correct 
match is higher in case the number of persons that score 
on the matching key value is smaller. Finally, it is also very 
likely that among the persons that score on a rare key value 
there are many uniques if the key is augmented with an 
additional variable. Records that score on such rare 
combinations of identifying variables are therefore more 
likely to be re-identified. 

In particular key values which occur only once in the 
population, /.e., uniques in the population, can lead to 
re-identification. In the past emphasis was placed almost 
exclusively on uniqueness. It should be noted, however, 
that uniqueness is neither sufficient nor necessary for 
re-identification. If a person is unique in the population 
on certain key variables, but nobody realizes this, then this 
person may never be re-identified. If on the other hand this 
person is not unique in the population, but there is only 
one other person in the population with the same key, then 
this other person is, in principle, able to re-identify him. 
Furthermore, suppose a person is not unique, but belongs 
to asmall group of people. Suppose also that the attacker 
happens to know information about him which is not 
considered to be identifying by the statistical office, but 
which is contained in the released microdata set, then it 
is very well possible that he is unique on the key combined 
with the new information. So, it is possible that a person 
is re-identified although he is not unique on the keys of 
identifying variables in the population. Finally, prediction 
disclosure may occur. That is, if a person is not unique in 
the population, but belongs to a group of people with 
(almost) the same score on a particular sensitive variable, 
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then sensitive information can be disclosed about this 
individual without actual re-identification. Prediction 
disclosure is not discussed further in this paper. For more 
information on prediction disclosure we refer to Skinner 
(1992), US Department of Commerce (1978), Duncan and 
Lambert (1986), and Cox (1986). 

SDC should concentrate on key values that are rare in 
the population. A probability that information from a 
particular respondent, whose data are included in a micro- 
data set, is disclosed should reflect the ‘rareness’ of the key 
value of this respondent’s record. A probability for the 
event that information from an arbitrary respondent is 
disclosed should reflect the ‘overall rareness’ of the records 
in the data set. If there are many records in a microdata 
set of which the key value is rare, then the probability of 
disclosure for this data set should be high. In the next 
sections we will examine some attempts to incorporate 
these ideas within a mathematical framework. 


5. RE-IDENTIFICATION RISK PER RECORD 


In an ideal world (as far as SDC is concerned) a releaser 
of microdata would be able to determine a risk of re- 
identification for each record, i.e., a probability that the 
respondent of this record can be re-identified. Such a risk 
per record would enable us to adopt the following strategy. 
First, order the records according to their risk of re- 
identification with respect to a single key. Second, select 
a maximum risk the statistical office is willing to accept. 
Finally, modify all the records for which the risk of 
re-identification with respect to the key chosen is too 
high. Repeat this procedure for each key in case there are 
more keys. 

Unfortunately, we do not live in such an ideal world at 
the moment. However, steps towards the ideal situation 
have been made by Paass and Wauschkuhn (1985), and 
Fuller (1993). In Paass and Wauschkuhn (1985) it is 
assumed that a potential attacker has both a microdata 
file, released by a statistical office, and an identification 
file at his disposal. Between both files there may be many 
data incompatibilities. These data incompatibilities may 
be caused by e.g., coding errors, by different definitions 
of categories or by ‘noise’ in the data. By assuming a prob- 
ability distribution for these data incompatibilities and a 
disclosure scenario Paass and Wauschkuhn develop a 
sophisticated model to estimate the probability that a 
specific record from the microdata file is re-identified. The 
type of distribution of the errors that caused the data 
incompatibilities was assumed to be known to the attacker. 
The variance of the errors was assumed unknown to him. 
A potential attacker had to estimate this variance, on the 
basis of the (assumed) knowledge of the statistical production 
process. The model of Paass and Wauschkuhn is essentially 
based on discriminant analysis and cluster analysis. 


Paass and Wauschkuhn distinguish between six different 
scenarios. Each scenario corresponds to a special kind of 
attacker. The number of records in the identification file 
and the information content of the identification file depend 
on the chosen scenario. An example of such a scenario is 
the journalist scenario, where a journalist selects records 
with extreme attribute combinations in order to re-identify 
respondents with the aim of showing that the statistical office 
fails to secure the privacy of its respondents. 

Paass and Wauschkuhn apply their method to match 
records from the identification file with records from the 
microdata file. If the probability that a specific record 
from the identification file belongs to a specific record 
from the microdata set is high enough, then these two 
records are matched. This probability is the probability 
of re-identification per record, conditional on a particular 
disclosure scenario. 

Miller, Blien, Knoche, Wirth et a/. (1991) and Blien, 
Wirth and Miiller (1992) applied the method recommended 
in Paass and Wauschkuhn (1985) to real data. When 
compared to simple matching, i.e., a record is considered 
re-identified by an attacker if he succeeds in finding a unique 
value set in the microdata file which is identical to a value 
set in the identification file, the method suggested by Paass 
and Wauschkuhn turned out to be not superior. Apparently, 
the number of correctly matched records when applying 
the method by Paass and Wauschkuhn was in disagreement 
with the probability of re-identification per record. 

In the context of masking procedures, i.e., procedures 
for microdata disclosure limitation by adding noise to the 
microdata, Fuller (1993) obtained an expression for the 
probability that a specific record in the released microdata 
set is the same as a specific target record from an identifica- 
tion file. That is, an expression for the re-identification 
probability per record is derived. To derive this expression 
several assumptions are made. It is assumed that the data, 
the noise and errors in the data are normally distributed. 
Moreover, it is assumed that the covariance matrices of 
both the noise and the errors in the data are known to an 
attacker. Finally, it is assumed that the data have been 
obtained by simple random sampling. These assumptions 
allow Fuller (1993) to derive his expression for the re- 
identification probability by means of probability theo- 
retical considerations. Unfortunately, the approach by 
Fuller has not been tested on real data yet. Hence, it is hard 
judge the applicability of this approach. For a comment 
on the approach by Fuller see Willenborg (1993). 

Paass and Wauschkuhn (1985), and Fuller (1993) are 
mainly interested in the effects of noise that has (uninten- 
tionally and intentionally, respectively) been added to the 
data on the disclosure risk. A weak point of their respective 
approaches is the, implicit, assumption that the key is a 
high-dimensional one. Assuming a high-dimensional key 
implies that (almost) everyone in the population is unique. 
The probability that a combination or key value occurs more 
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than once in the population is negligible. This makes the 
computation of the probability of re-identification per record 
considerably easier. On the other hand, in case of low- 
dimensional keys it is not unlikely that certain key values 
occur many times in the population. Therefore, deriving 
a probability of re-identification per record for low- 
dimensional keys is much harder than for high-dimensional 
keys, because for high-dimensional keys the probability 
of statistical twins in the population is almost zero. 

A good model for the re-identification risk per record 
does not appear to exist at the moment. In Section 6 we 
therefore consider less ambitious models, namely models 
for the re-identification risk per file. 


6. RE-IDENTIFICATION RISK PER FILE 


In a somewhat less ideal world a releaser of microdata 
would not be able to determine the risk of re-identification 
for each record, but he would be able to determine the risk 
that an unspecified record from the microdata set is re- 
identified. In this case, the statistical office should decide 
on the maximal risk it is willing to take when releasing a 
microdata set. If the actual risk is less than the maximal 
risk, then the microdata set can be released. If the actual 
risk is higher than the maximal risk, then the microdata 
set has to be modified. Determining which records have 
to be modified remains a problem, however. 

A basic model to determine the probability that an 
arbitrary record from a microdata set is re-identified has 
been proposed by Mokken, Pannekoek and Willenborg 
(1989) and Mokken, Kooiman, Pannekoek and Willenborg 
(1992). In Mokken ef al. (1989) only the case where there 
is a single researcher, an unstratified population and a 
single key is considered. It has been extended to include 
the cases of subpopulations, multiple researchers and 
multiple keys (cf. Willenborg 1990a; Willenborg 1990b; 
Mokken et al. 1992). The model of Mokken et al. (1992) 
takes three probabilities into account. The first probability, 
f, is equal to the sampling fraction. In other words, f, is 
the probability that a randomly chosen person from the 
population has been selected in the sample. The second 
probability, f,, is the probability that a specific researcher 
who has access to the microdata knows the values of a 
randomly chosen person from the population on a particular 
key. The third probability, f,,, is the probability that a 
randomly chosen person from the population is unique in 
the population on a particular key. Combining these three 
probabilities, f, f, and f,,, the probability that a record 
from a microdata set is re-identified can be evaluated. 

For each sample element a number of variables is 
measured. The values obtained by these measurements 
(scores) are collected in records, one for each sample 
element. It is assumed that the variables in the key are 
either categorical variables or variables for which the 
measurements fall into a finite number of categories. 
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Together, the records constitute a data set S that will be 
made available to an researcher R. We recall that whenever 
we use the term disclosure in fact re-identification disclo- 
sure is meant. The model of Mokken ef al. (1989, 1992) 
does not take prediction disclosure into account. 

In terms of the Paass and Wauschkuhn (1985) set-up 
f, and f,, together reflect the Informationsgehalt der 
Uberschneidungsmerkmale, i.e. , the information content 
of the matching values. The various scenarios they consider 
differ in terms of f, and f,. In particular, f,, is influenced 
by the number of variables and the information content 
of these variables, i.e., their categorization, an attacker 
has at his disposal to re-identify a record. The parameter 
Ff, is determined by the number of records that are con- 
tained in the information file. 

With respect to researcher R and key K there is a circle 
of acquaintances A. Obviously, A andits size | A | will 
depend on the particular researcher R as well as on the key 
K and the variables as registered and coded in the data set. 

It is assumed that if conditions C,, C, and C; of the 
conditions for re-identification given in Section 3 hold, then 
conditions C4, C; and C¢ hold too. Condition C, is a rather 
exacting one, but it can be introduced as an assumption 
for the sake of convenience in formulating a disclosure risk 
model. Note that it then yields a worst-case situation, in 
the sense that fallible perception and memory or other sources 
of ignorance, confusion and uncertainty for a potential 
discloser are excluded. Taken as an assumption together 
with C; and C, the implication is that the occurrence of 
any unique acquaintance E of R in data set Sis equivalent 
to re-identification by R. It is assumed that re-identification 
of arecord implies disclosure of confidential information. 
Thus re-identification can be treated as equivalent to 
disclosure. Implicitly, it is assumed that the link between 
the identifying variables and the sensitive variables has not 
been disturbed by a technique such as data-swapping. 

Furthermore it is assumed that both the identifying and 
the confidential information are free of error or noise to 
researcher R, contrary to e.g., Paass and Wauschkuhn 
(1985), and Fuller (1993). Clearly, this assumption is 
unrealistic for most microdata sets. 

The disclosure risk Dp for a certain microdata set S 
with respect to a certain researcher R and acertain key K, 
is defined to be the probability that the researcher makes 
at least one disclosure of a record in S on the basis of K. 
In order to apply a criterion based on the disclosure risk, 
the value of this quantity for a given data set has to be 
determined. An expression for this quantity can be derived 
on the basis of a set of assumptions. 

In the model of Mokken ef al. the following assump- 
tions are made in addition to C; — C¢: 


A,. The circle of acquaintances A can be considered as a 
random sample from the population. 


A. The data set S is arandom sample from the population. 
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Assumption A, serves to imply that the probability 
that a randomly chosen element from the population is an 
acquaintance of Ris f, = | A |/N, where Nis the size of 
the population. As a consequence the expected number of 
unique elements in A, | U, |, is equal to f, | U| = 
| A | f,, where U is the set of unique persons in the 
population and | U | its size. Obviously assumption A) 
implies that the probability that a specific unique element 
E is selected in the sample is f. These assumptions allow 
one to obtain a very simple expression for the disclosure 
risk Dp in terms of f, f, and f,,, namely 


Dp = 1 — exp(-Nff,f.)- (1) 


Two of the parameters in the model of Mokken et al. 
(1989, 1992), f, and f,,, are unknown. The parameter /, 
can be ‘guestimated’, i.e., obtained by inspired guesswork, 
by assuming different scenarios an attacker may follow. 
A number of such scenarios has been described in Paass 
and Wauschkuhn (1985) and Paass (1988). Evaluating f, 
seems difficult, however. In order to estimate the other 
parameter, f,,, a number of models has been proposed in 
the literature. Models to estimate the number of uniques 
in the population, and hence /,,, that have been proposed 
include the Poisson-gamma model (Bethlehem, Keller and 
Pannekoek 1989; Mokken et al. 1989; Willenborg, Mokken 
and Pannekoek 1990; De Jonge 1990), the negative binomial 
superpopulation model (Skinner, Marsh, Openshaw and 
Wymer 1990), the Poisson-lognormal model (Skinner and 
Holmes 1992; Hoogland 1994), models based on equivalence 
classes (Greenberg and Zayatz 1992) and models based on 
modified negative binomial-gamma functions (Crescenzi 
1992; Coccia 1992). As we have remarked in Section 4 not 
only the number of population uniques is important, but 
the numbers of cells with two, three, efc. persons are 
important as well. The Poisson-gamma model, the Poisson- 
lognormal model and the negative binomial superpopula- 
tion model can be applied to estimate the number of cells 
with two, three, efc. persons as well. It seems that the other 
models mentioned above can be extended in order to 
estimate these numbers. A major drawback is that the 
results are not very reliable in many cases. 

From the model by Mokken ef al. (1989, 1992) it is clear 
that the statistical office that disseminates the data is able 
to influence the risk of re-identification. The statistical 
office basically has two ways to do this. First of all, the 
size of the data set can be reduced, i.e., the sampling 
fraction f can be reduced. A reduction of f implies a 
reduction of the risk. However, lowering f is generally 
undesirable, because usually fhas to be reduced substan- 
tially to be effective. This implies that only a small part 
of the data available can be released. The second way in 
which the statistical office can influence the re-identification 
risk is by reducing the number of population uniques, i.e., 
by reducing f,,. The fraction f, depends on the information 


provided by the key variables. The less information the key 
variables provide the less uniques there are in the popula- 
tion. In order words, f,, can be reduced by collapsing 
categories (global recoding) and by replacing values by 
missings (local suppression). Collapsing categories is a 
global action, because it generally affects many records; 
replacing values by missings is a local action because it 
affects only a few individual records. Usually, the loss in 
information when reducing f,, is considerably less than 
the loss in information when reducing f. Therefore, a 
statistical office will usually choose to control the re- 
identification risk by reducing f,, rather then reducing f. 
The third possibility of controlling the re-identification 
risk, i.e., by reducing f,, is not applied in practice, because 
J, is difficult to model. 

Although the model by Mokken ef al. (1989, 1992) 
provides some insight in how to reduce the disclosure risk 
it can hardly be used as a basis for the protection of micro- 
data sets. The reason for this is that the two parameters 
of the model, f,, and f,, are often difficult to evaluate. 
Usually there is insufficient data available to estimate /, 
and f, accurately. We conclude that even a model for a 
re-identification risk for an entire microdata set is difficult 
to apply in practice. In Section 7 we therefore face reality 
in which we have no satisfactory model for either the re- 
identification risk per record or re-identification risk for 
an entire microdata set. 


7. INTUITIVE RE-IDENTIFICATION RISK 


In reality we are, unfortunately, forced to base SDC on 
heuristic arguments rather than on a solid theoretical basis. 
The SDC rules mentioned in this section all reduce the re- 
identification risk. It is, however, not possible to evaluate 
this reduction of the re-identification risk. At Statistics 
Netherlands, rules for SDC of microdata are based on 
testing whether scores on certain keys occur frequently 
enough in the population. A few problems arising here are 
the determination of the keys that have to be examined, 
the way to estimate the number of persons in the population 
that score on a certain key, to make operational the meaning 
of the phrase ‘frequently enough’ by determining e.g., 
(a) threshold value(s), and how to determine appropriate 
SDC-measures. 

Statistics Netherlands distinguishes between two kinds 
of microdata sets. The first kind is a so-called public use 
file. A public use file can be obtained by everybody. The 
keys that have to be examined for a public use file are all 
combinations of two identifying variables. The number of 
identifying variables is limited, and certain identifying 
variables, such as place of residence are not included in 
a public use file. Moreover, sampling weights have to be 
examined before they can be included in a public use file, 
because there are many situations in which weights can give 
additional information (cf. De Waal and Willenborg 1995a). 
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For instance, when a certain subpopulation is oversampled 
then this subpopulation can be recognized by the low 
weights associated with its members in the sample. Weights 
may only be published when they do not provide additional 
information that can be used for disclosure purposes. 
In case sampling weights are not considered suited for 
publication SDC measures should be taken, such as sub- 
sampling the units with a low weight in order to get a sub- 
sample in which all units have approximately the same 
weight. Because the weights are approximately equal 
assuming that they are exactly equal would introduce only 
a small error. The second kind of microdata set is a so- 
called microdata set for research. A microdata set for 
research can only be obtained by well-respected (statistical) 
research offices. The information content of a microdata 
set for research is much higher than that of a public use 
file. The number of identifying variables is not limited and 
an identifying variable such as place of residence may be 
included in a microdata set for research. Because of the 
high information content of a microdata set for research, 
researchers have to sign a declaration stating that they will 
protect any information about an individual respondent 
that might be disclosed by them. The keys that have to be 
examined for a microdata set for research consist of three- 
way combinations of variables describing a region with 
variables describing the sex, ethnic group or nationality 
of a respondent with an ordinary identifying variable. 

The rules Statistics Netherlands applies for SDC are 
based on the following idea: a key value, i.e., a combina- 
tion of scores on the identifying variables that together 
constitute the key, is considered safe for release if the 
frequency that this key value occurs in the population is 
more than a certain threshold value do. This value dy was 
chosen after a careful and extensive search considering 
many different values and comparing the records which 
have to be modified for each value of dp. The value that 
leads to the ‘most likely’ set of records which have to be 
modified has been chosen to be the value of dy. Which 
records are considered to be the ‘most likely’ ones to be 
modified is a matter of personal judgment. 

When applying one of the above rules we are generally 
posed with the problem that we do not know the number 
of times that a key value occurs in the population. We only 
have the sample available to us. The population frequency 
of a key value has to be estimated based upon the sample. 
For large regions it is possible to use an interval estimator 
to test whether or not a key value occurs often enough in 
a region. This interval estimator is based on the assump- 
tion that the number of times that a key value occurs in 
the population is Poisson distributed (cf. Pannekoek 
1995). However, for relatively small regions the number 
of respondents is low, which causes the estimator to have 
a high variance which in turn causes a lot of records to 
be modified. To estimate the number of times that a key 
value occurs in a small region we therefore suggest to apply 
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a point estimator. We will now discuss some possibilities 
for such an estimator. 

A simple point estimator for the number of times that 
a certain key value occurs in a region is the direct point 
estimator. The fraction of a key value in a region / is 
estimated by the sample frequency of this key value in 
region / divided by the number of respondents in region 
i. The population frequency is then estimated by this 
estimated fraction multiplied by the number of inhabitants 
in region 7. When the number of respondents in region i 
is low, which is often the case, the direct estimator is un- 
reliable. Another point estimator is based on the assump- 
tion that the persons who score on a certain key value are 
distributed homogeneously over the population. In this 
case the fraction of a key value in region i can be estimated 
by the fraction in the entire sample. The advantage of this, 
so-called, synthetic, estimator is that the variance is much 
smaller than the variance of the direct estimator. Unfor- 
tunately, the homogeneity assumption is usually not 
satisfied which causes the estimator to be biased. However, 
a combined estimator can be constructed with both an 
acceptable variance and an acceptable bias by using a 
convex combination of the direct estimator and the syn- 
thetic estimator. Such a combined estimator has been 
tested in Pannekoek and de Waal (1995). 

Another practical problem that deserves attention is 
top-coding of extreme values of continuous (sensitive) 
variables. These extreme values may lead to re-identification 
because these values are rare in the population. At the 
moment Statistics Netherlands uses an interval estimator 
to test whether there is a sufficient number of individuals 
in the population who score on a ‘comparable’ value of 
the continuous variable (cf. Pannekoek 1992). If this is the 
case, then the extreme value may be published, otherwise 
the extreme value must be suppressed. In order to apply 
this method in practice it remains to specify what is meant 
by ‘sufficient’ and by ‘comparable’. 

Some important practical problems occur when deter- 
mining which protection measures should be taken when a 
microdata set appears to be unsafe. In that case the original 
data set must be modified in such a way that the informa- 
tion loss due to SDC-measures is as low as possible while 
the resultant data set is considered safe. In De Waal and 
Willenborg (1994a) and De Waal and Willenborg (1995b) 
a model for determining the optimal local suppressions 
is presented. Determining the optimal global recodings 
is much more difficult. Comparing the information loss 
due to global recodings to the information loss to local 
suppressions is already a problem. In De Waal and 
Willenborg (1995c) this latter problem is solved by using 
the entropy. 

Currently a general purpose software package for SDC 
of microdata is being developed at Statistics Netherlands 
(cf. De Jong 1992; De Waal and Willenborg 1994b; Van 
Gelderen 1995; Pieters and De Waal 1995; De Waal and 
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Pieters 1995). The package, ARGUS, should enable the 
statistical office to analyze the data and to carry out 
suitable protection measures. It will consist of two separate 
parts: u-ARGUS for SDC of microdata and 7-ARGUS for 
SDC of tabular data. The structure of the package is such 
that it will be possible to specify different disclosure control 
rules. This implies that ARGUS will be suited for other 
statistical offices too. Moreover, it will be possible to 
incorporate changes in the rules fairly easily in the package. 


8. CONCLUSIONS 


There is one important conclusion one can draw from 
this paper: SDC still offers a lot of possibilities for future 
research, despite the considerable amount of research that 
has been carried out to date. The theory of SDC for 
microdata has a number of gaps. Among the technical 
problems that remain to be solved are the following. When 
we want to release data for small regions we need an accep- 
table estimator for the number of times that a key value 
occurs in these regions. Such an estimator is difficult to 
construct, although the preliminary results obtained at 
Statistics Netherlands seem encouraging. An important 
practical problem is the determination of appropriate 
global recodings and local suppressions. Yet another one 
is the determination of the number of uniques, or more 
generally the number of rare frequencies, in the population. 
Some of the models proposed in Section 6 appear to be 
acceptable, but can probably be improved upon. An alter- 
native approach is to determine which elements in the 
sample are unique in the population. In Verboon (1994), 
and Verboon and Willenborg (1995) this approach is 
examined. An extension of the model by Mokken ef al. 
(1989, 1992) to estimate the risk of re-identification of a 
file is yet another problem to be solved. This extension 
should take into account that measurement errors have 
been made and that population uniqueness is not necessary 
in order to disclose information. Finally, a model to 
estimate the re-identification risk per record would be very 
welcome. In fact, it would yield a sound criterion to judge 
the safety of a microdata set. This criterion can guide one 
in producing safe microdata sets by applying SDC-measures 
such as global recoding and local suppression. 

Apart from technical problems there are also some 
policy problems. Based on the policy that a statistical 
office wants to pursue the following decisions should be 
made. The combinations of variables that should be 
examined should be specified. Suitable threshold values 
should be selected. 

More and better software must be developed in order 
to deal with time-consuming calculations. For microdata, 
software must be developed to indicate which records and 
variables must be modified, and how they should be 
modified, when applying a particular disclosure rule. At 


the time of writing an international project on SDC is 
about to start. The participating institutions in this project 
are the Eindhoven University of Technology, the University 
of Manchester, the University of Leeds, the Office of 
Population Censuses and Surveys (OPCS), the Istituto 
Nazionale di Statistica (ISTAT), the Consortio Padova 
Ricerche (CPR), and Statistics Netherlands. One of the 
major aims of the project is to develop software for the 
SDC of both microdata (u-ARGUS) and tabular data 
(7-ARGUS). 

Finally, some very practical problems remain to be 
solved. An example of such a problem is the determination 
of a set of rules for selecting identifying variables. Such 
a set of rules would be a very valuable asset. Without these 
rules identifying variables are selected by making subjec- 
tive choices. Developing such a set of rules is another goal 
of the above mentioned SDC-project. 
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GUIDELINES FOR MANUSCRIPTS 


Before having a manuscript typed for submission, please examine a recent issue (Vol. 19, No. 1 and onward) of 
Survey Methodology as a guide and note particularly the following points: 


4.2 


5.2 


Layout 


Manuscripts should be typed on white bond paper of standard size (8% xX 11 inch), one side only, entirely double 
spaced with margins of at least 1% inches on all sides. 

The manuscripts should be divided into numbered sections with suitable verbal titles. 

The name and address of each author should be given as a footnote on the first page of the manuscript. 
Acknowledgements should appear at the end of the text. 

Any appendix should be placed after the acknowledgements but before the list of references. 


Abstract 


The manuscript should begin with an abstract consisting of one paragraph followed by three to six key words. 
Avoid mathematical expressions in the abstract. 


Style 


Avoid footnotes, abbreviations, and acronyms. 

Mathematical symbols will be italicized unless specified otherwise except for functional symbols such as 
“exp(:)”’ and “‘log(-)’’, etc. 

Short formulae should be left in the text but everything in the text should fit in single spacing. Long and important 
equations should be separated from the text and numbered consecutively with arabic numerals on the right if 
they are to be referred to later. 

Write fractions in the text using a solidus. 

Distinguish between ambiguous characters, (e.g., w, w; 0, O, Oa): 

Italics are used for emphasis. Indicate italics by underlining on the manuscript. 


Figures and Tables 


All figures and tables should be numbered consecutively with arabic numerals, with titles which are as nearly 
self explanatory as possible, at the bottom for figures and at the top for tables. 

They should be put on separate pages with an indication of their appropriate placement in the text. (Normally 
they should appear near where they are first referred to). 


References 


References in the text should be cited with authors’ names and the date of publication. If part of a reference 
is cited, indicate after the reference, e.g., Cochran (1977, p. 164). 

The list of references at the end of the manuscript should be arranged alphabetically and for the same author 
chronologically. Distinguish publications of the same author in the same year by attaching a, b, c to the year 
of publication. Journal titles should not be abbreviated. Follow the same format used in recent issues. 
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In This Issue 


This issue of Survey Methodology begins with a special section entitled Weighting and Estimation 
which contains four papers. 

The first paper in this special section, by Singh and Mohl, gives an overview of calibration methods 
from a different perspective, with the objective of gaining a better heuristic understanding of these 
methods. Deville and Sarndal presented calibration methods as minimizing the overall distance of the 
final weights from the survey weights, subject to the restriction that estimates of totals of certain 
covariates match known population totals. Singh and Mohl present different calibration methods as being 
derived from different models for the weight adjustment factors. Computational algorithms for different 
methods are provided in an appendix, and a numerical example is given to illustrate how the resulting 
weight adjustment factors might vary among the different methods. 

Stukel, Hidiroglou and Sarndal also investigate calibration estimators, the class of design-based point 
estimators developed by Deville and Sarndal. These estimators are derived from distance functions and 
allow for restricting of the final weights such that they are positive or upwardly bounded, thus avoiding 
the usual problem of negative weights that arises from using the regression estimator. Through 
simulation, the properties of a number of these estimators based on different distance functions are 
studied; particular emphasis is given to the properties of the corresponding variance estimators, 
specifically the Jackknife and the Taylor. The surprising conclusion is that the bias of both the point 
estimators and the corresponding variance estimators is minimal, even under severe restricting of the final 
weights. 

Jayasuriya and Valliant compare three methods of deriving household weights for the Consumer 
Expenditure Survey of the U.S. Bureau of Labor Statistics. Survey weights are usually calibrated to 
population totals of individual level characteristics, resulting in different final weights for individuals in 
the same household. The principal person method defines the final weight for the household to be the 
same as that for a particular person in the household. The regression approach replaces the vector of 
auxiliary variables for each individual in a household by the household average, resulting in identical 
calibrated weights for persons in the same household. Another option is obtained by restricting the weight 
adjustment factors to avoid extreme or negative weights. Variations on these methods are compared with 
respect to the final weights and the estimated CVs for a variety of household expenditure categories. 

In the final paper in the section on Weighting and Estimation, Chen and Chen consider the problem 
of confidence interval estimation for a finite population average when auxiliary information is available. 
Noting the earlier results of Royall and Cumberland that show that naive use of existing design-based 
methods results in confidence intervals with very poor conditional coverage probabilities, they suggest 
transformations of the data which improve the adherence to the underlying normality assumption and thus 
improve the coverage rates. Auxiliary information is incorporated in two ways: either directly into the 
inference when auxiliary information is known for each unit or through calibration with empirical 
likelihood when auxiliary information is known only at the population level. Through simulation applied 
to six real populations, they show that their methods perform well. 

In their paper, Thompson and Fisher modify the one and two sample McNemar tests for use with 
complex survey data. They then apply the modified two sample test to data from the U.S. Bureau of the 
Census Current Population Survey’s Split Panel Study to test whether or not the shift to computer 
assisted telephone interviewing using a redesigned questionnaire would affect the estimates of 
unemployment. Results of this test are discussed and compared to other research on the effect of CATI 
on unemployment estimates. 

Eltinge and Jang suggest ways for evaluating the stability of estimates of variance components 
(specifically within-PSU variance estimators) and other related quantities, under a complex three-stage 
design. As measures, they consider a simple design-based variance estimator of the within-PSU variance 
estimator, as well as an estimated “degrees of freedom” approach. A simulation based method permits 
the assessment as to whether an observed stability measure is consistent with standard assumptions 
regarding variance estimator stability. They apply the proposed methods to NHANES III data and show 
that true stability properties may vary substantially across variables, and that within-PSU variance 
estimators can be substantially less stable than one would anticipate from using a simple count of 
secondary units within each stratum. 
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Berger discusses Chao’s plan for sequentially selecting an unequal-probability sample of fixed size 
without replacement. In this context, he suggests an approximation of the second-order probabilities of 
inclusion in order to obtain an approximate estimator of the variance for the Horvitz and Thompson 
estimator. This variance is then compared to apprroximations given for other procedures or selection 
plans. Equivalence conditions for these approximations are presented. 

Cowling, Chambers, Lindsay and Parameswaran present two techniques for producing spatially 
smoothed data and consider their implications in both small and large area estimation. For the small area 
application, the sample weights are spatially smoothed using a modified linear regression approach, 
which results in a decrease in the variance but an increase in the bias of the estimates. For the large area 
application, a nonparametric regression method is used to spatially smooth the data and then the 
smoothed data is mapped using a Geographic Information System package. The results of a simulation 
study are presented, in which the most appropriate method and level of smoothing for use in the maps 
is investigated. 

Brick, Waksberg and Keeter suggest using information on interruptions of telephone service so as to 
adjust the survey estimates to compensate for undercoverage bias. The data collected on telephone service 
interruptions serve to reduce the bias, but at the same time the variance is likely to increase owing to the 
greater variability of the sampling weights. The results obtained from a national survey show a significant 
potential for reducing the mean square error of the estimates under certain conditions. 

Finally, Pandher uses a model based approach to find an optimal partition of a survey population into 
take-all and take-some strata. The approach assumes that there is a single variable of interest and that 
probability proportional to size sampling is used in the take-some stratum. An algorithm is presented for 
determining the optimal cut point between the take-all and take-some groups. A key requirement for the 
algorithm is that the model expectation of the variance is a convex function of the number of units in the 
take-all stratum, which depends on the model assumptions and the form of the inclusion probabilities. 
The method is applied to Statistics Canada’s Local Government Finance Survey. 
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Understanding Calibration Estimators in Survey Sampling 


A.C. SINGH and C.A. MOHL’ 


ABSTRACT 


There exist well known methods due to Deville and Sarndal (1992) which adjust sampling weights to meet benchmark 
constraints and range restrictions. The resulting estimators are known as calibration estimators. There also exists an earlier, 
but perhaps not as well known, method due to Huang and Fuller (1978). In addition, alternative methods were developed 


by Singh (1993), who showed that similar to the result of Deville-Sarndal, all these methods are asymptotically equivalent: 


to the regression method. The purpose of this paper is threefold: (i) to attempt to provide a simple heuristic justification of 
all calibration estimators (including both well known and not so well known) by taking a non-traditional approach; to do 
this, a model (instead of the distance function) for the weight adjustment factor is first chosen and then a suitable method 
of model fitting is shown to correspond to the distance minimization solution, (ii) to provide to practitioners computational 
algorithms as a quick reference, and (iii) to illustrate how various methods might compare in terms of distribution of weight 
adjustment factors, point estimates, estimated precision, and computational burden by giving numerical examples based 
on areal data set. Some interesting observations can be made by means of a descriptive analysis of numerical results which 
indicate that while all the calibration methods seem to behave similarly to the regression method for loose bounds, they 


however seem to behave differently for tight bounds. 


KEY WORDS: Benchmark constraints; Distance minimization; Non-negative weights; Range restrictions. 


1. INTRODUCTION 


In providing estimates from sample surveys, sampling 
weights are commonly adjusted to obtain calibrated weights 
in order to match totals or benchmark constraints (BCs) for 
auxiliary variables. The methods of regression and raking are 
often used for this purpose. Although these methods have 
good asymptotic properties (see Deville and Sarndal 1992), 
they may lead to calibrated weights with undesirable (finite 
sample) properties. The regression method can give negative 
weights while the raking procedure can produce very high 
weights. For this reason, range restrictions (RRs) may be 
imposed on the calibrated weights. It would be desirable to 
have a calibration method which (i) produces calibrated 
weights close to the original sampling weights; this can be 
achieved via minimization of a suitable distance function 
between the two sets of weights, (ii) meets BCs, and (iii) 
satisfies RRs. There exist several methods in the literature for 
weight adjustment under BCs and RRs, see e.g., Deville and 
Sarndal (1992, henceforth referred to as DS) for recent 
developments, and Huang and Fuller (1978) for earlier 
developments. For a review, as well as some further work, see 
Singh (1993, henceforth referred to as Singh). These methods 
are iterative in nature and can be classified into two families. 
Family I consists of methods which satisfy BCs after each 
iteration and continue to iterate until RRs are met. Family II, 
on the other hand, consists of methods which satisfy RRs 
after each iteration and continue to iterate until BCs are met. 


Methods of DS belong to family I while that of Huang-Fuller 
belongs to family I. Two additional methods, one for each 
family, were proposed by Singh. Using arguments similar to 
DS, Singh extended the remarkable result of DS by showing 
that all of the methods in families I and II are asymptotically 
equivalent to the regression method. 

In Section 2, a non-traditional approach is followed in 
introducing each method which is expected to help in under- 
standing of calibration estimators. The functional form of the 
weight adjustment factor is first heuristically motivated and 
later on a connection between a suitable method of model 
fitting and minimization of the distance function is made. 
Alongside, computational algorithms are given as a quick 
reference for practitioners. A computer program in GAUSS 
software is available from the second author; see also Singh 
and Mohl (1997). In Section 3, numerical examples are pre- 
sented to illustrate various methods using data from Statistics 
Canada's Family Expenditure (FAMEX) survey. It is of prac- 
tical interest to see how different calibration methods might 
compare for a real data set. In particular, we examine by means 
of a descriptive analysis the impact of RRs on the computa- 
tional burden, distribution of weight adjustment factors, point 
estimates and their variance. Related comparative studies on 
calibration methods based on real data sets are due to Deville, 
Sarndal and Sautory (1993) and Stukel and Boyer (1993). 
These studies, however, are restricted to family II methods 
and are primarily concerned with the distribution of weight 
adjustment factors. Finally, Section 4 contains a discussion. 


1 A.C. Singh, Methodology Research Advisory Group, and C.A. Mohl, Health Statistics Methods Section, Household Survey Methods Division, Statistics 


Canada, Ottawa, K1A OT6. 
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2. HEURISTIC JUSTIFICATION OF 
CALIBRATION ESTIMATORS 


We will use the following notation. Let n, N denote respec- 
tively the sample size and the population size. Let h, denote 
the initial or h-weight (used in the expansion or Horvitz- 
Thompson estimator 7, y,h,) for the k-th element where y, 
is the value of the study variable. It is assumed that the 
h-weights include adjustments for any non-response. The 
parameter of interest is the population total for y, denoted by 
t,. For each k, there are p-auxiliary variables, x, j = 1, ..., p 
for which the population total or benchmark eonstcaint 

Ty; ae 1%,; for each j is assumed to be known. The 
transposed p-vector x, denotes (%,;, ..., X,,), the k-th row of the 
n X p matrix X. Let om denote the calibrated or c-weight for 
the k-th element at aie v-th iteration. At v = 0, ous h,. The 
expansion estimators of population totals for varbles y and 
x, using c-weights at the v-th iteration are denoted by i and te 
respectively. 

The RRs are specified by the condition L < g, < U where 
8, = C,/h, and L < 1 < U, where L and U denote suitable lower 
and upper bounds. The adjustment factors (i.e., g,'s) are also 
called g-weights. First we consider the unrestricted case (i.e., 
calibration without RRs) and then the restricted case. All 
methods in the restricted case require iterations for finding a 
solution. It is assumed that the iterative process converges in 
a finite number of iterations. 

The criterion for convergence is defined as follows. For 
the iterative process to meet RRs, a tolerance level € (e.g., 
-005 or .01) for family I is defined so that the process ter- 
minates if the maximum absolute relative error (ARE) for 
RRs is < €. Similarly, a tolerance level (6 > 0) for family II is 
defined for meeting BCs by iterations. The reason for this is 
that our primary goal is not minimization of the distance 
function, but to find a solution which satisfies BCs and RRs. 
In addition to € and 6, a parameter v,,,, is defined which limits 
the number of iterations. 

There are seven methods considered in this paper, two for 
the unrestricted case, two for restricted case in family I and 
the remaining three also for the restricted case but in family 
II. We have given alternative names to existing methods to 
facilitate understanding of the relationship between different 
methods. The naming convention is based on the well known 
distance measures used in the analysis of count data. 

Note that since all the methods are asymptotically equiv- 
alent to the regression method, the asymptotic variance of ty 
can be estimated for each method by ),Y (ty -1,%)) Tf 
(€,8,)(€)8,), as in DS (equation 3.4) where 1,,7,, are respec- 
tively the first and second order inclusion probabilities, 
e, are the sample residuals y, - Bx, with B’ = (y'T)X) 
(ee X)', and I, is the n x n matrix diag(h). 


2.1 METHOD 1 (Linear Regression or Unrestricted 
Modified Chi Square, MCS-u) 


This method is the simplest and gives rise to the popular 
generalized regression estimator of Sarndal (1980). Here, the 


model for the adjustment factor is taken to be linear in x, i.e., 
g, = 1+x,A, for some p-vector of model parameters A which 
satisfies BCs. That is, 7-)4,(1 +x{A)xy =1,, for all j. 
This*gives*rise to’ "AMES" eas (XTX) (t, — )~ The 
c-weights remain close to the h-weights in the sense that the 
above choice of g-weights minimizes the distance function, 
AMS™ (c,h) = Yj, (¢, - h,)’/h, subject to BCs. Note that the 
g-weights could be negative for some k. This is rather 
undesirable in practice although the simplicity of the method 
is quite attractive. The computational algorithm for MCS-u is 
given in Appendix A1. 


2.2 METHOD 2 (Raking or Unrestricted Modified 
Discrimination Information, MDI-u) 


This method is also commonly used. Here, the model for 
the adjustment factor g, is taken as exp(x, 4), thus making it 
necessarily non-negative. Unlike the case of method 1, the 
model parameter vector AMP!" is obtained iteratively to 
meet BCs. The iterations can be started with AM°S from 
the GR- estimator, i.é., for iteration 1, set A = AMS | which 
implies ci? = h,exp (x 4). These c-weights, in peo 
do not éatisty BCs. For iteration 2 of this method, the A 
is adjusted (by a term of smaller order) to define 1? 
as A) + (X’T, X) 1 (t, - =), where I’, = diag (ec). The A 
term is defined similarly for further iterations until conver- 
gence, i.é., until BCs are met. The c-weights remain close to 
h-weights because iterations used in the above method 
constitute the Newton-Raphson steps for minimizing the dis- 
tance function, AM”'"(c,h) = Y7_, [c, log(c,/h,) - ¢, + hy] 
subject to BCs. Note that although the g-weights are non- 
negative, they could be very high which is clearly not 
desirable in practice. The computational algorithm for MDI-u 
is given in Appendix A2. 


2.3 METHOD 3 (Modified Huang-Fuller or Scaled 
Modified Chi Square, SMCS) 


This method belongs to family I of the restricted case and 
is a slight modification of the method due to Huang and Fuller 
as given in Singh; see also Fuller, Loughin, and Baker (1994). 
As in regression, the model for the adjustment factor is taken 
to be linear in x. To facilitate the satisfaction of RRs by these 
adjustments, a scaling factor q,, (0 <q, < 1), is used for each 
k so that the change in h-weights for those units whose g,'s 
tend to go outside the bounds [L,U] is reduced. Thus, the 
8-weight is given by g, = 1 + q,x{ A where the model para- 
meters q and A are chosen iteratively in the sense that A is 
found for a given q and then q is found for a given A. We start 
with g® =1 for all k and seta = AMCS~" for iteration 1. 
Now, clearly ce“ satisfies BCs but RRs need not be satisfied. 
Depending on the location of g,'s in relation to [L,U], a 
working rule can be used to define q,'s so that the q,'s 
discount more for those units which are farther outside of the 
we erates than those which are nearer. The scaling factors 
q,. So determined, define in turn 2® for iteration 2 as 


(X'T, X)1(t, ~ 4.) where TP, = diag (qi4h,), qi) = = gq, 
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A satisfying BCs after the iteration. Note that under usual 
regularity conditions, A® differs from A“ only by a term of 
smaller order, since the maximum absolute difference 
|g,” — 1 is small. Next, if 9 after iteration 2 does not satisfy 
RRs, the scaling factors q\ G are defined appropriately and 
compounded with qT to get qg, for use in iteration 3. The 
A® for iteration 3 is then obtained as before so that BCs 
are satisfied after the iteration. Iterations continue until 
convergence, i.e., until RRs are met. The weight vector c°M°S 
is close to h because at each iteration v, c™ minimizes 
the distance function ASMS(e,h) = Yh-1(c,- ty)/hyae’ 
subject to BCs, where gi’ = g®qW..q@- for v > 1. 
Note that unlike the previous methods, the distance function 
varies from iteration to iteration. 

The computational algorithm for SMCS is given in 
Appendix A3. Note that in the algorithm, [L, U] is shrunk to 
[L’, U’] by means of a parameter « where L'’=aL+1 - a, 
U'=aU+1- a,and0<a < 1. This implies that some units 
that are inside [L, U] but close to the boundary are also 
discounted. This helps to speed up the convergence. Another 
parameter B, 0 < B < 1 is also introduced to allow differential 
discounting of different units. 


2.4 METHOD 4 (Shrinkage-Minimization, SM) 


This method also belongs to family I and is due to Singh. 
As in regression, the model for the adjustment factor is taken 
to be linear in x, but a new parameter termed the shrinkage 
factor w, (0< J, < 1) is used for each k so that g,'s meet RRs, 
i.é., 8, is set at (1+,x, A(k)). Notice that A is allowed to 
depend on k through wp, and x,. Unlike SMCS, here the 
g-weights, after discounting, satisfy RRs exactly, i.e., those 
g-weights which are outside [L, U] are shrunk to lie on or 
inside the boundary. Therefore, w,'s can be defined quite 
easily in practice. The model parameters w and A are chosen 
iteratively in a manner analogous to that for SMCS. We start 
with y® =1 and set 4“ = AMCS™ for iteration 1 to obtain 
gy as (1+ )x/A). Clearly BCs are satisfied after the 
iteration oye tS need not be. Before iteration 2, g ® is ) 
shrunk by Wy to obtain g\* as (1 + wx, A) where yl = 

yi, which meets RRs. Given p'", A®(k) is obtained 
as AO + (1D) (X'T XY! (4-29) +x XT XY! 
(t, - #2*)A® where T, = diag(e’), cf?* =h,gi?*, and 
#())* is the expansion estimator using c*-weights. Again 
BCs are satisfied after the iteration but RRs need not be. Note 
that A(k) differs from A® by a term of smaller order 
uniformly over k. Iterations are continued until convergence, 
i.e., until RRs are met. The weight vector c™ is close to h 
because at each iteration v > 1, c™ minimizes the distance 
function, AS(c,e"9*) = Y"_(q,- ce’ P ley * subject 
to BCs. Note that in practice ec ana be obtained directly 
from c™ without having to calculate y™ separately. As with 
SMCS, the distance function depends on the iteration. 

The computational algorithm is given in Appendix A4. 
Recall that in the above method, if a g-weight falls outside of 
the L and U boundaries, an adjustment is made to bring the 
g-weight back to the L or U boundary. A new parameter 
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a (0 < a < 1) is introduced to allow the user to bring the 
g-weight farther inside the boundary to a point L’ or U’ 
(L' =aLb+1-a, U’'=aU+1-«a). This is somewhat 
similar to the a parameter of SMCS. Another parameter 
n(0 <7 < a < 1) is introduced to adjust the g-weights to the 
level L’ or U’ also for those units which are within [L, U], but 
close to the boundary in that they are outside [L”, U”] where 
L" =nL+1-n, U" = nU+1-7. All these parameters 
help speed up the convergence in general. 


2.5 METHOD 5 (Linear Truncated or Restricted 
Modified Chi Square, MCS-r) 


This well known method belongs to family IJ of the 
restricted case and is due to DS. As in SM, the model for the 
adjustment factor is taken to be linear in x with a new 
parameter termed the truncation factor , (0 < @, < 1) which 
is used for each k so that g,'s meet RRs, .e., g, is set at 
(1+ Op A(k)). The only difference between the truncation 
factor , used here and the shrinkage factor used in SM is 
that here those g-weights which are outside [L, U] are 
always adjusted to lie exactly on the boundary. ae model 
parameters are chosen iteratively. Initially we set oy = 1 and 
at iteration 1, A = AMS* to obtain g{) =(1 + oF si), 
which is further adjusted (or aaah to obtain BS 
(1+ Ox /A%) where oi = 6 oY, so that RRs are met. 
However, g“” may not satisfy BCs. Note that the difference 
between g" and gM@°S* is of smaller order. Now, for itera- 
tion 2, A™ is adjusted by a term of smaller order Gpnrounly 
over k) to define A®(k) as AM + (1/42) (X’ Le Greats) 
where I’; = diag(h) except that the diagonal elements are 
truncated to zero for all those k for which oy, <1, ie, those 
units which were truncated at the previous iteration. This 
discounting of diagonal elements is somewhat similar to using 
a zero scaling factor in SMCS. In the second iteration, we 
have g2 = 1 + pi xj4(k) and the truncation factors $2 } 
are used to obtain g’” which satisfy RRs. The successive 
iterations are defined in a similar manner. Clearly, unlike SM, 
here RRs are met at each iteration. Iterations are continued 
until BCs are met. The weight vector, c“°>* is close to h 
because the iterations defined above constitute the Newton- 
Raphson steps for minimizing the distance function 
AMCSS*(e,h) = Y,(c, - h,)’/h, if Lh, < ¢, < Uh; © otherwise, 
subject to BCs. The computational algorithm is given in 
Appendix AS. oe that, in practice, it is more convenient iy 
work with au directly without having to compute o 
separately. 


2.6 METHOD 6 (Restricted Modified Discrimination 
Information or MDI-r) 


This method also belongs to family II and was proposed by 
Singh following the lines of DS in developing MCS-r. It is 
related to MDI-u in the same way as MCS-r is to MCS-u. The 
basic idea is to adjust parameters ¢ and A in the adjustment 
factor g, = ,exp(x,A) so that RRs and BCs are satisfied. 
The truncation parameter @ is similar to that for MCS-r. This 
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is done iteratively. Similar to MCS-r, at iteration 1 we set 
a.” = bexp(x{ A) where $9 = 1,4 = AMOS, which 
is Rorthes adjusted by a term of smaller order to obtain ae 
as py exp(x{A) so that RRs are met, i.e., it lies in [L, U]. 
Next for iteration 2, he is aaunes by a term of smaller 
order to obtain 2; Q) as dy exp(x{A”), where A@ = 
AD (XT, XO aA), and I, Sdiag(be) except that 
the diagonal elements are tquneated to O for all those k for 
which 6')’ < 1. The truncation factors oe ’ are used to ensure 
that RRs are met. Iterations are continued until convergence, 
i.e., until BCs are met. The weight vector cM?!" is close to 
h because the iterations defined above constitute the 
Newton-Raphson steps for minimizing the distance function 
AMDM (ec hy= Yip le,log(c,/h,) —c, +h,] if Lh, < ¢, < Uh; © 
otherwise, subject to BCs. Note that in practice, the trunca- 
tion factors are not needed separately to compute gy ; 
Appendix A6 gives the computational algorithm for MDI-r. 


2.7 METHOD 7 (Logit or Generalized Modified 
Discrimination Information, GMDI) 


This is the last method considered. This well known 
method of family II is due to DS. As in the raking method, we 
Start with exp(x, A) and an inverse logit-type transformation 
is used to ensure that the adjustment factor satisfies RRs. The 
model for the adjustment factor is given by g, = [((U - 1) + 
(1 - L) exp(Ax,A)}* (LU - 1) + UCI - L) exp(Ax,A)], 
where A = (1 - L)'(U - 1)' (U - L). This adjustment factor, 
unlike other methods, lies necessarily inside the interval 
[L, U], i.e., does not take boundary values. As L — 0 and 
U -~ ~, the factor reduces to the familiar inverse logit form, 
exp(x, A)/[1 + exp(x ; A)]. The model parameter A is obtained 
iteratively to meet BCs. Starting with AMS" as A for 
iteration 1, we adjust by a smaller order term to obtain 1 
as AY + (XTX) (tr, - #2) where T= diag(h,d\”), 

dP == 170 = D1 = eye Spey Furie era! 
dots are done in a similar manner until BCs are met. The 
weight-vector c&™' is close to h in the sense that subject to 
BCs, the above iterative process corresponds to the Newton- 
Raphson algorithm for eas the distance function 
ASMP! (c,h) given by A~ ae rel (eee log{ Maabye 
(8, Lhe U "ep loge) ‘U - &,)}]. Appendix A7 
gives the computational algorithm for GMDI. 


3. NUMERICAL EXAMPLES 


3.1 Data Description 


We consider application of the seven adjustment methods 
described above to data from the 1990 Statistics Canada's 
Family Expenditure (FAMEX) Survey for the two cities (or 
domains) of Regina and Saskatoon in the province of 
Saskatchewan. Four study variables are considered: annual 
expenditures on owned dwelling for repair and renovation, 
furniture and equipment, ladies' clothing , and men's clothing. 
The FAMEX survey is a supplementary survey to the 
Canadian Labour Force Survey (LFS) and, therefore, is based 
on the LFS design — a multistage stratified cluster sample of 


households, see Singh et al. (1990). Samples are drawn 
independently from the two cities of Regina and Saskatoon. 
Respectively for the two cities, the numbers of strata are 30 
and 34, and the numbers of primary sampling units (PSUs) 
selected in the sample are 111 and 94. The total numbers of 
sampled households are 321 and 278, while the corresponding 
numbers (n) of individuals are 797 and 712. 


3.2 Benchmark Constraints, Range Restrictions and 
Common Weights per Household 


The number (p) of BCs is four for each domain. They 
correspond to the demographic population counts for the four 
groups: age < 15, age > 15, one person households, and 
households with two or more persons. The corresponding 
counts are 40696, 139047, 12746, and 48457 for Regina, and 
42544, 139299, 20628, and 52059 for Saskatoon. Thus, the 
total numbers of households for the two domains are 61203 
and 72687 respectively and the corresponding population 
sizes (N) are 179743 and 181843. The auxiliary x-variables 
here are indicators for the above four groups. 

For Regina, (min, max) of g-weights are obtained as 
(-0.72, 2.74) and (0.19, 3.95) respectively for regression and 
raking methods. It is therefore of interest to make them 
nonnegative for regression and to reduce the high weights for 
raking. Two types of RRs are chosen: one has somewhat 
loose bounds with L = 1/5 and U = 5 and the other has 
somewhat tight bounds with L = 2/5 and U = 5/2. For 
Saskatoon, (min, max) of g-weights are obtained as (0.86, 
1.08) and (0.87, 1.09) respectively for regression and raking 
methods. Note that both methods give g-weights close to 1, 
and therefore there is no real need for RRs. However, for the 
sake of illustration, we choose L = 0.88 and U = 1.12. 

The initial sampling weights or h-weights of individuals in 
the same household are common and equal to the weight of 
that household. It is desirable that after calibration, all 
members of a household have the same c-weights. This can be 
achieved by modifying the X matrix so that x,-values for each 
person in the same household are common and equal to the 
average value for the household, see, e.g., Lemaitre and 
Dufour (1987). We also perform an initial scaling on the 
h-weights so that they add up to N;; this is similar to the Hajek 
modification of the Horvitz-Thompson estimator. This scaling 
essentially redefines [L, U] to make them meaningful for 
calibration of h-weights. 


3.3 Descriptive Measures for Comparison 


For comparing various methods, we consider four types of 
descriptive measures: 
(i) Summary statistics for the distribution of the g-weights, 
(ii) Point estimates for several variables, 
(iii) Estimated precision of the calibration estimates, and 
(iv) Computational burden imposed by each method. 


The first measure consists of a graphical summary using 
a box plot for g-weights, and the standard deviation of 
g-weights, SD(g), defined as [N -')7_,h,(g, - 1)?]!”. Note 
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that the mean of g-weights, i.e., N'Y, A,g,, is 1 in view of 
the fact that A, =)c,=N, and the SD(g) also equals 
PONIES Che born, V2 the square root of a normalized 
chi-square type distance for measuring closeness between 
h- and c-weights. For comparing point estimates and their 
precision for estimating parameter for each variable y of 
interest, we compute relative difference (RD) and relative 
precision (RP) with respect to the MCS-u weights, e., 
relative to the regression estimator. Denoting an estimator 
based on c-weights as a c-estimator, we have RD as 
(c-estimator minus regression estimator) divided by the 
regression estimator, and RP as SE(regression estimator) 
divided by SE(c-estimator). Note that for the numerical 
examples under consideration, variances are computed using 
jackknifing by deleting PSUs. Finally, the computational 
burden is expressed in terms of the number of iterations. 
Testing has shown that for all the restricted methods, each 
iteration takes a similar amount of time and hence a good 
comparison of their computational burden is the number of 
iterations required for convergence. 


3.4 Specification of Other Parameters 


We also need to specify some other parameters, namely, «, 
B for SMCS, and a, n for SM. Empirically, values of 
« = 0.67, 7 = 0.9 and B = 0.8 are found to perform well. The 
tolerance levels € for family I and 6 for family II are set at 
0.01, and v,,,, is set at 10. 


3.5 Results: A Descriptive Analysis 


3.5.1 Distribution of g-weights 


We first consider the Regina data. Figure 1 gives a box 
plot of the distribution of g-weights with L = 0.4 and U = 2.5. 
Note that there are negative g-weights (and hence negative 
c-weights) for MCS-u and large g-weights (which produce 
large c-weights) for the MDI-u method. For MCS-u, the 
fraction of g-weights < 0 is 4.9%, the fraction < 0.4 is 5.9%, 
the fraction above 2.5 is 1.25% while above 3.5 is 0%. For 
MDI-u, the fraction below 0.4 is 4.9%, the fraction > 2.5 is 
4.3% and above 3.5 is 1.25%. Thus, both methods yield 
c-weights which are out of bounds with respect to RRs with 
tight bounds. The range restricted methods all have median 
g-weights between 0.65 and 0.75; the SMCS g-weights show, 
however, the most clustering around the median. Table 1 
shows that under loose bounds, the SD(g) for each restricted 
method is slightly higher (about 7%) than the regression 
method, but for tight bounds, the difference increases to about 
15% for family I and about 10% for family IT. 

Now for the Saskatoon data, Figure 2 gives a box plot of 
g-weights with L = 0.88 and U = 1.12. For both regression 
and raking methods, about 5.6% are below L and 0% are 
above U. All methods have similar interquartile range for 
g-weights with medians slightly above 1. Also it is seen from 
Table 1 that SD(g) for all the methods (restricted and 
unrestricted) are about the same and quite small. 


111 
Table 1 
Number of Iterations and SD(g) 
(C20 MDE ontl = OG= O=_Oleven— LO) 
Regina Saskatoon 
E=02,U=5:0 =a = 25 L=0.88, 
Method (Loose bounds) (Tight bounds) (OAD 
Number of ») Number of Number of 
iterations (8) iterations SD(g) iterations SD(g) 
Family I ; 
SMCS 2) 0.647 3 0.702 2, 0.071 
SM 2D 0.636 4 0.689 2 0.070 
Family II 
MCS-r 2 0.628 3 0.654 l 0.069 
MDI-r 3 0.642 3 0.660 1 0.069 
GMDI 3 0.640 3 0.659 2 0.069 


Note: For the unrestricted (or no bounds) case, the number of iterations and 
SD(g) are: for Regina MCS-u and MDI-u are (1,0.599) and (3,0.647) 
respectively; for Saskatoon MCS-u and MDI-u are (1,0.070) and 
(1,0.069) respectively. 


3.5.2 Relative Difference of Point Estimates 


Tables 2(a) and (b) show that for Regina, under loose 
bounds RD is small for all the methods for each of the 
variables. In fact, it is negligible except for the variable 
“owned dwelling” for which it is generally under 4%. 
However, under tight bounds, it increases somewhat but 
remains small with values ranging between 1% and 5%. For 
Saskatoon (Table 2c), under the given bounds RD is 
negligible for all the methods. 


3.5.3 Estimated Relative Precision of Estimates 


For Regina, under loose bounds, RP is generally within 5% 
(of the precision of the regression estimator) for all methods 
and all variables except for MDI-r with the variable “‘ladies' 
clothing” for which it is lower by 9%. However, under tight 
bounds, RP varies more and is now generally within 9% 
except for SMCS and SM with the variable “Men's clothing” 
(RP is lower by 20%) and MDI-r for the variable “Ladies' 
clothing” for which RP is lower by 11%. For Saskatoon 
(Table 2c), under the chosen bounds RP is close to | for all 
cases. 


3.5.4 Computational Burden 


For Regina (Table 1), under loose bounds each method 
takes two or three iterations. As the bounds are tightened, 
most of the methods require more iterations to converge. To 
see how tightly the bounds could be squeezed before 
encountering convergence problems, three more sets of 
bounds were used with [L, U] = [0.425, 2.35], [0.45, 2.22] 
and [0.475, 2.11]. These results are not shown in the table. 
With v,,,, aS 10, the SM method does not converge for [0.425, 
2.35]. The SMCS and GMDI methods do not converge 
for [0.45, 2.22] and the MCS-r and MDI-r finally have 
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Figure 1. Box Plot: g-weights for Regina FAMEX data (L = 0.4, U = 2.5) 
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Figure 2. Box Plot: g-weights for Saskatoon FAMEX data (L = 0.88, U= 1.12) 
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Table 2a 
Difference in Point Estimates and Precision Relative to 
Regression Estimator (« = .67, B = .8, n=.9,€ =5=.01, v,,,, = 10) 
Regina: L = 0.2, U = 5.0 (Loose Bounds) 


Owned Dwelling Fumiture\Equipment 
RD RP RD RP 
Family I 
SMCS -0.043 1.047 0.001 1.032 
SM -0.036 1.032 -0.002 1.040 
Family I 
MCS-r -0.032 1.035 0.002 1.034 
MDI-r -0.033 0.991 -0.008 1.037 
GMDI -0.037 0.999 -0.004 1.041 
Ladies’ Clothing Men’s Clothing 
Family I 
SMCS 0.015 0.931 0.009 0.952 
SM 0.010 0.951 0.006 0.968 
Family I 
MCS-r 0.011 0.950 0.008 0.964 
MDI-r 0.007 0.911 -0.001 0.961 
GMDI 0.009 0.940 0.002 0.968 
Notes: 
1. RD and RP denote respectively “relative difference’ and “relative 
precision”. 
2. For the unrestricted (or no bounds) case, the corresponding measures for 


the raking (MDI-u) method relative to regression are (- 0.034, 1.005), 
(-0.008, 1.049), (0.004, 0.968) and (0.002, 0.980) for the four study 
variables respectively. 


Table 2b 
Difference in Point Estimates and Precision Relative to 
Regression Estimator (a = .67, B = .8, n= .9,€ =5=.01, v,,,, = 10) 
Regina: L = 0.4, U = 2.5 (Tight Bounds) 


Owned Dwelling Furniture\Equipment 
RD RP RD RP 
Family I 
SMCS - 0.056 1.100 0.012 1.000 
SM -0.055 0.992 0.017 0.919 
Family I 
MCS-r -0.048 1.073 0.008 0.952 
MDI-r -0.045 1.087 0.012 0.965 
GMDI -0.047 1.077 0.009 1.006 
Ladies’ Clothing Men’s Clothing 
Family I 
SMCS 0.024 0.917 0.038 0.808 
SM 0.025 0.917 0.024 0.801 
Family I 
MCS-r 0.020 0.904 0.012 0.922 
MDI-r 0.025 0.888 0.012 0.922 
GMDI 0.021 0.938 0.018 0.917 


Note: During the jackknifing procedure, the SM method failed to converge 
in ten iterations for four pseudo-replicates (out of a total of 111). 
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Table 2c 
Difference in Point Estimates and Precision Relative to 
Regression Estimator (a = .67, 8 = .8, n= .9,€ =5=.01, vin, = 10) 
Saskatoon: L = 0.88, U = 1.12 


Owned Dwelling Furniture\Equipment 
RD RP RD RP 
Family I 
SMCS -0.001 1.001 -0.001 0.999 
SM -0.000 1.001 -0.000 0.999 
Family II 
MCS-r 0.000 0.999 0.000 1.000 
MDI-r 0.002 0.997 0.002 0.994 
GMDI -0.000 1.007 -0.000 0.990 
Ladies’ Clothing Men’s Clothing 
Family I 
SMCS 0.000 1.013 -0.001 0.999 
SM -0.000 1.002 -0.000 0.998 
Family I 
MCS-r 0.000 0.990 0.000 0.994 
MDI-r 0.002 1.001 0.002 0.983 
GMDI 0.000 0.977 -0.000 0.990 
Notes: 


1. For the unrestricted (or no bounds) case, the corresponding measures for 
the raking (MDI-u) method relative to regression are (0.002, 1.000), 
(0.002, 1.000), (0.002, 1.002) and (0.002, 0.995) for the four study 
variables respectively. 

During the jackknifing procedure, the SM method failed to converge in 
ten iterations for two pseudo-replicates (out of a total of 94). 


S 


convergence problems for [0.475, 2.11]. For Saskatoon 
(Table 1), under the chosen bounds each method takes only 
one or two iterations. With v,,,, as 10, as bounds are tightened 
to [0.92, 1.08], SM does not converge. At [0.93, 1.07], 
SMCS, MCS-r, and MDI-r have convergence problems, and 
finally at [0.96, 1.06], GMDI has problems. 


4. DISCUSSION 


Although numerical results for a few variables for two 
different domains considered in this paper are quite limited to 
draw general conclusions, the results based on a descriptive 
analysis are nevertheless interesting and may provide some 
indications which might be useful in practice. These can be 
summarized in the following observations. For loose bounds, 
all the restricted methods seem to perform almost at par with 
the regression method. However, for tight bounds, there seem 
to be a difference in point estimates and especially in 
estimated precision. This observation clearly needs further 
study in light of the fact that all methods are asymptotically 
equivalent to the regression method. A simulation study in 
this regard would be desirable. The recent study of Stukel, 
Hidiroglou, and Sarndal (1996) sheds some light on this issue. 
Moreover, for tight bounds, there may not be convergence 
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under the specified number of iterations even if a solution 
exists. This problem may be more apparent in dealing with 
jackknife replicates. Therefore, caution should be exercised 
in choosing the maximum number of iterations for tight 
bounds. Finally, in practice, it is possible that even with 
minimal requirements on BCs and RRs, none of the cali- 
bration estimators converge within a reasonable number of 
iterations. In this situation, it would be of interest to investi- 
gate whether the (asymptotic) design consistency of calibra- 
tion estimators could be preserved while allowing deviation 
from BCs. The idea of using ridge regression by Bardsley and 
Chambers (1984), although not in the design-based context, 
may be useful for this purpose. This problem is currently 
being investigated in collaboration with J.N.K. Rao. 


APPENDIX 


Here we provide computational algorithms for all seven 
methods of weight adjustment. These algorithms were used to 
write computer programs in GAUSS software for the 
numerical examples presented in this paper. 

In all the methods, some form of the following expression 
denoted by the n-vector f, is used repeatedly for computing 
ow for Vi= 12.9 


f= XK Rabe Xa ta) (1) 


where I',_, is ann x n diagonal matrix defined below in the 
algorithm for each method. Initially [, = diag(h) and 
t= Dxgh, 


Al. METHOD 1 (MCS-u) 


The solution is non-iterative and is given in two steps as 
follows. 
(i) | Compute fa wigs ie n from (1) by setting I, =o. 
(ii) Compute g,as 1 +f. k and then Basis ee 


A2. METHOD 2 (MDI-u) 


The solution is obtained iteratively by the following steps 

fOR Vise ee 

(i) Set the tolerance level 6 > 0 for meeting BCs at some 
small value. 

(ii) | For the v-th iteration, compute f Wh k = 1 ton, from 
(1) by setting I, = diag(c,” ne 

(Goby) Jer WY SS Il De 
gue = 1 and nen “8 from h, ae 

(iv) Repeat steps (ii)-(iii) until the BCs are met up to the 
tolerance level 6 or the number of iterations is at its 


maximum, V,,,,- Lhe last iteration gives ee ” 


A3. METHOD 3 (SMCS) 


The solution is obtained iteratively as follows. 
(i) Set the RRs, i.e., choose L and U,L<1<U. 
(ii) | Set the tolerance level € > O at a small value for 
meeting the RRs. 


(iii) | Choose a parameter a between 0 and 1 (e.g. 2/3) and 
sete Salles U os Ui i. G4.cchnendefault 
value of 1 for a is also allowed in which case L’ = L, 
Uso: 

(iv) For the v-th iteration with g@ =1, define éY ” = 
(g- u - 1) if gh? <1; @EP- NK’ - 1) 
otherwise. 

(v) | Choose maaaeto parameter f between 0 and 1 (e.g., 
4/5), Set gy = 1-if FP < 10 Be ie 
if 1/2 < EY<1; (1 -B/A)Y D if €°" > 1 and then 
define for v= 1/2, %., qi =u, e.g, | where 


g® =1. Note chmpoundine of g-factors in defining 


Gan 1 

(vi) Carne fP from (1) by setting T_ 
and t(’") = ao for all v. 

(vii) Find g as 1+q°"f and then c®? as h,g°). 

(viii) Repeat steps (iv)-(vii) until the RRs are met up to the 
tolerance level € or v = V,,,,- The last iteration gives 
oe The value of B should remain the same at each 


iteration. 


, =diag(h,q; a) 


A4. METHOD 4 (SM) 


This method consists of the following steps performed 
iteratively. 
(i)-(ii) Same as in Method 3. 
(iii) | Choose parameters a, n, O<a<n<1,(eg.,0=2/3, 
7 = 9/10) and define 
De sOet lee) ye Ol a> ee lee) 
EOa nh (k= 0), 0% = bin Vise) 
The default option for « and n is 1 in which case 


| Bo BE ney Ul Die tue 
(iv) eee) The i from tee bs th eaeatss iS shrunk 
to obtain ral essai toc,’ =L*h, if ce te L*h,; 


Uhgsife->-Uhesey" jihenticea boat 0, 
eOI= EO” =f. i. 

(v) LUTON Find f;, from (1) by setting 
| Ree Se = diag(cy ) and Re Di 

(vi) Compute eC cashes (1 “fe where g’’ * = 

c’)"/h, and then c” from h,g%). 

(vii) Reve one (iv)-(vi) until the RRs are satisfied up to 

tolerance € or V = V,,,. The last iteration gives c>M. 


A5. METHOD 5 (MCS-r) 


The iterative algorithm consists of the following steps. 
(i) Set L and U. 
(ii) | Set the tolerance level 6 > 0 for meeting the nee 
(iii) Compute f yu from (1) by setting I’,_, = diag(h, a\ a) 


where a’) = 1 if g{» was baimeated Ol ce U, and 
0 otherwise. 
(iv) Set g?=1 and compute gi” as gi! +f if 


L< g” < U; otherwise truncate g\” to L or Uas the 
case may be, and then c\” as ye 

(v) Repeat steps (iii)-(iv) ‘antl BCs are met at the 
tolerance level 6 or v = V,,,,. The last iteration gives 


MCS-— ae 
Cy 
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A6. METHOD 6 (MDI-r) 


The iterative algorithm consists of the following steps. 

(i)-(Gii) Same as in Method 5. 

(iii) Compute f\ from (1) by setting T\, = 
diag(c(’ Ya’) where a\’ ” is defined as in Step 
(iii) of Method 5. 

(iv) Set g{?=1 and compute g? = g°Yexp(f™) if 
L < g\” < U; otherwise truncate g‘) to L or U as the 
case may be, and then c\” as h, 29”. 

(v) Repeat steps (iii)-(iv) until BCs are satisfied at 
tolerance 6 or v = v,,,,. The last iteration gives c a 7 


A7. METHOD 7 (GMDI) 


The iterative algorithm consists of the following steps. 

(i)-(ii) Same as in Method 5. 

(iii) Compute ff? from (1) by setting IT. , = 
diag(h, di’) where d{’") is analogous to d() of 
Section 2.7. 

(iv) Using xj/AM =x/A0+f, find g® from the 
formula for g, given in Section 2.7, and then c\”” as 
re 

(v) Repeat steps (iii)-(iv) until BCs are met at tolerance 6 
or V = V,,,,. The last iteration gives ae 
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Variance Estimation for Calibration Estimators: 
A Comparison of Jackknifing Versus Taylor Linearization 


DIANA M. STUKEL, MICHAEL A. HIDIROGLOU and CARL-ERIK SARNDAL! 


ABSTRACT 


The use of auxiliary information in estimation procedures in complex surveys, such as Statistics Canada’s Labour Force 
Survey, is becoming increasingly sophisticated. In the past, regression and raking ratio estimation were the commonly used 
procedures for incorporating auxiliary data into the estimation process. However, the weights associated with these 
estimators could be negative or highly positive. Recent theoretical developments by Deville and Sarndal (1992) in the 
construction of “restricted” weights, which can be forced to be positive and upwardly bounded, has led us to study the 
properties of the resulting estimators. In this paper, we investigate the properties of a number of such weight generating 
procedures, as well as their corresponding estimated variances. In particular, two variance estimation procedures are 
investigated via a Monte Carlo simulation study based on Labour Force Survey data; they are Jackknifing and Taylor 
Linearization. The conclusion is that the bias of both the point estimators and the variance estimators is minimal, even under 


severe “restricting” of the final weights. 


KEY WORDS: Auxiliary information; Raking ratio estimators; Regression estimators; Restricted weighting. 


1. INTRODUCTION 


Auxiliary information has many uses in survey sampling. 
One typical use is its incorporation at the estimation stage 
through the use of regression estimators or raking ratio esti- 
mators. For these estimators, a unit’s sampling weight is 
multiplied by an adjustment factor to produce the final 
weight. A well-known shortcoming associated with the 
regression estimator is that some of the adjustment factors 
may be negative, resulting in negative final weights. On the 
other hand, for the raking ratio estimator, some adjustment 
factors may be very large and positive, resulting in unduly 
large final weights. These shortcomings can be overcome by 
considering a family of estimators, known as “calibration 
estimators”. Developed by Deville and Sarndal (1992), the 
estimators in this family incorporate auxiliary information, 
and in certain cases, non-negative weights can be ensured by 
prespecifying lower and upper bounds on the weights. These 
“calibration” weights are obtained by minimizing functions 
which measure the distances between original sampling 
weights and final calibrated weights, while respecting a set of 
benchmarking constraints. Huang and Fuller (1978) and 
Singh and Mohl (1996) have developed similar estimators 
which maintain the above properties. Ordinarily, there are 
very small differences between the point estimates cor- 
responding to the various distance functions. 

Historically, Statistics Canada’s Labour Force Survey 
(LFS) has used, at different points in time, both the Taylor 
and Jackknife variance estimation techniques in tandem with 
regression and raking ratio estimators. Recently, the LFS has 
also allowed for the option of using other calibration esti- 
mators in addition to the previously available regression 


estimator, to eliminate the problem of potential negative 
weights. It is therefore of interest to investigate the behaviour 
of these point estimators and their corresponding Taylor and 
Jackknife variance estimators, particularly for those esti- 
mators that allow bounding on the weights. Therein lies the 
main focus of this paper. Now, both the Taylor and the 
Jackknife have their advantages. The Taylor method is com- 
putationally much less intensive than the Jackknife method, 
but requires working out new expressions for each different 
parameter that is considered; this is particularly a burden in 
multipurpose surveys where many different parameters may 
be of interest. On the other hand, for the Jackknife method, 
cumbersome variance expressions need not be derived for 
each new parameter; only the functional form of the point 
estimator itself is required. 

The paper is structured as follows: section 2 provides the 
theoretical underpinnings of calibration estimation and intro- 
duces a family of related distance functions. In section 3, 
variances for calibration estimators are discussed. Section 4 
provides the results of a Monte Carlo simulation study, in 
which the bias of both the point estimators and their cor- 
responding Taylor and Jackknife variance estimators (relative 
to a “true” variance) is tracked, for a variety of distance func- 
tions from calibration theory. In section 5, some concluding 
remarks are made. 


2. DISTANCE FUNCTIONS AND CALIBRATION 
ESTIMATORS 


We begin by introducing the basic idea behind calibration 
estimation. Let U = {1, ..., k, ..., N} denote the index set for 
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the N units of a finite population of units. In survey sampling, 
one is often interested in estimating parameters of a finite 
population such as totals, means and ratios. For the sake of 
simplicity, we will focus on totals, although the ideas 
presented in this paper may easily be extended to include 
other parameters. Thus, suppose the objective is to estimate 
the population total Y = )’,.yy,, where y, is the value of y, 
the variable of interest for the k-th population unit. 

A probability sample s is drawn from U by a given 
sampling design which induces the inclusion probabilities 
1, = P(kes). These are assumed known and positive. Let 
a, = 1/m, be the sampling weight associated with the k-th unit. 
Finally, let the auxiliary information be specified in the form 
of known population totals of one or more auxiliary variables. 

An elementary estimator of Y is the Horvitz-Thompson 
(HT) estimator: 


Ih es Ds ay Vy 


The HT estimator possibly but not necessarily (depending 
on the sampling design) incorporates auxiliary information at 
the design stage only; what is sought is an improved estimator 
which incorporates the auxiliary information at the estimation 
stage, as well. The incorporation of auxiliary information can 
be reflected in the creation of new weights, denoted by w,; 
kes. The new estimator is then of the form: 


Vee Segal (2.1) 


kes 


The approach of Deville and Sarndal (1992) and Deville, 
Sarndal and Sautory (1993) involves determining these new 
weights {w,: kes} by making them as close as possible to the 
original sampling weights {a,; kes} according to a specified 
distance function. Constraints placed on the new weights are 
such that, when applied to each of the auxiliary variables, the 
known population total X is reproduced. That is, 


w,x,=X 
» ik (2.2) 
is required to hold, leading to a problem in constrained 
minimization. Here Xai Oita X,, 1s a vector of length 
p containing the values of the auxiliary variables for the k-th 
individual, and the auxiliary information available from an 
external source is summarized by the known vector total 
X= VecuXe 

We denote the distance from w, to a, by F*(w,, a,). Deville 
and Sarndal (1992) limit their discussions to distance 
functions of the form F*(w,, a,) = a,c,F(w,/a,) where 
w,/a, = &,, the ratio of the final calibrated weight to original 
sampling weight, is called the “g-factor”. Here c, is a known 
positive weight unrelated to a,; the uniform weighting c, = 1 
is often used in applications. Note that equation (2.1) can 
alternatively be written as: 


¥,, 3 y BENE 
kes 


It is assumed that F is non-negative and convex, and that 
F(1) = 0, implying that when w, = a, the distance between the 
weights is zero. Moreover, it is required that F’ is continuous, 
one-to-one, and that F’(1) = 0 and F’(1) > 0 which makes 
w, = a, a local minimum. (See Deville, Sarndal and Sautory 
1993.) The total distance, )’,..a,c,F(w,/a,), is minimized 
subject to the constraint (2.2). That is, 


> a,c, F(w,/a,) - | Ye, WX, > 4 


kes 


is minimized with respect to the w,, where A is a p-vector of 
Lagrange multipliers. Differentiating with respect to w,, 
equating to zero, and solving for w, leads to the calibrated 
weights w, = a, 8, = a, g(A'x,/c,) where g is the inverse 
function of f and f(z) = dF(z)/dz. To compute w,, one must 
first obtain 4 as the solution of the calibration equation 
implied by (2.2), namely, 


Ss a, g(A'x,/c,)x, =X. (2.3) 
kes 

The solution of this (possibly) nonlinear system of p 
equations in p unknowns may require the use of some itera- 
tive procedure, such as the Newton-Raphson method. 

A number of distance functions are considered by Deville 
and Sarndal (1992), Huang and Fuller (1978) and Singh and 
Mohl (1996). Two important distance functions which we 
first discuss are the Generalized Least Squares (GLS) distance 
function and the Raking Ratio (RR) distance function, both 
given in Deville and Sarndal(1992). 


The GLS distance function is defined by: 


F*(w,,4,) = Fors (,,4,) 
= c,(w, - a,)"/a, = a,c, (w,/a,- 1)?. (2.4) 


It generates the well-known generalized regression 
estimator (GREG), which encompasses as special cases the 
ratio estimator, the simple regression estimator, and the 
simple post-stratified estimator, among others. It follows from 
(2.3) that the calibrated weights corresponding to the GLS 
distance function are: 


oy / Sil 
w,=4,8,=4,[1 + (X- X,) (= A;x;X; ') x,/c,] 


Jes 


where Le = Vrcs%X, is the HT estimator of X. The 
corresponding estimator of Y can be written in the usual 
regression estimator form as 


Use a Y, FAX X,)'B (2.5) 
where 
A i -] 
6 = ( » iS Xe cr » GX Vile (2.6) 
ES €S 


Survey Methodology, December 1996 


Thus, the regression estimator can be thought of as the HT 
estimator plus an adjustment term. A drawback of the GLS 
distance function is that it may give rise to negative weights, 
particularly if the system is overconstrained. In practice, 
negative weights are rare; however, it is desirable to eliminate 
them entirely since it may be difficult to give them any 
meaningful interpretation. 


The Raking Ratio (RR) distance function is defined by: 
F *(Wy54,) = Fer (Wy>4,) 
= c, [w, log (w,/a,) - w, + a,] (2.7) 


= a,c, [(w,/a,) log (w,/a,) - (w,/a,) + 1). 


Solving for g-factors using the RR distance function and 
the constraint defined by equation (2.3) can be shown to be 
equivalent to using the Iterative Proportional Fitting (IPF) 
algorithm of Deming and Stephan (1940) when calibrating on 
known marginals of frequency tables of dimension two or 
higher. Unlike the GLS distance function, which has a closed 
form solution, the calibration equations for the RR distance 
function can only be solved iteratively. Computer software 
exists for this purpose; for example, the CALMAR software 
(see Deville, Sarndal and Sautory 1993) solves the calibration 
equations for the RR distance function using the Newton- 
Raphson method, rather than the IPF algorithm originally 
proposed by Deming and Stephan. The RR distance function 
always ensures positive weights; however, it also has the 
undesirable property that some of the resulting calibration 
weights can be excessively large. 

Neither the possibility of negative weights produced by the 
GLS distance function nor the possibility of large positive 
weights produced by the RR distance function are desirable. 
One can define restricted distance functions whereby the 
range of the resulting weights w, are limited. This is achieved 
by imposing restrictions on the distance function F(w,/a,) in 
such a way that the g-factors g, = w,/a, are bounded within 
a prespecified interval. To this end, one can specify a lower 
bound L and an upper bound U, such that L < 1 < U. To 
guarantee positive weights, one would choose L > 0. Now, 
Deville and Sarndal (1992) define restricted versions of the 
two distance functions given above; they are: the Restricted 
GLS (RGLS) distance function and the Restricted Raking 
Ratio (RRR) or Logit distance function. Two other methods 
of restricting final weights are proposed by Huang and Fuller 
(1978) and Singh and Mohl (1996). All four restricted 
distance functions are considered in this paper; they are also 
discussed in detail in Singh and Mohl (1996), but from a 
different perspective. 


The Restricted GLS distance function is defined by: 
KAW,,@,) = 
Co a,)’/a, if L<w,la,<U 


Frois(w 1a,) = 
nhs c otherwise. (2.8) 
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The Restricted RR (or Logit) distance function is defined 
by: 


F*(W1.4,) = Fag >%) = 
A ‘ce, [(w,/a, - L)log[(w,/a, - L)/( - L) 
+ (U - w,/a,)log((U - w,/a,)/(U - 1))] 


if L<w,la,<U 


co otherwise (2.9) 


where A =(U - L)/{(. — L)(U- 1)}. The specification L = 0, 

=o gives the RR distance function. It is easy to show that 
the Restricted GLS and Restricted RR distance functions 
share the property that the corresponding weights w, satisfy 
L<w,/a,< U. 

Now, Huang and Fuller (1978) propose a method for 
adjusting regression weights such that the calibration 
constraints given by equation (2.2) are satisfied and such that 
the g-factors are restricted to lie close to one. Singh and Mohl 
(1996) show that their method can be written in terms of 
minimizing a distance function which changes from iteration 
to iteration. Singh and Mohl also modify the original method 
to allow for arbitrary restrictions on the g-factors, similar to 
the restricted distance functions above, and show that the 
estimator resulting from the modified distance function is 
asymptotically equivalent to the regression estimator. The 
Modified Huang-Fuller (MHF) distance function is given by: 


(v-1) 


F*(w,””, a,) = eC »a,) 
Ay Gena) ag) vem... (2u0) 


-])* - 4 . 
where a we ae eee with Gee = 1 and where v is 


the iteration number. Here, 


l fee cs 


Gee ao 5) eat Sate <I 


Ca S74 eles TEMES east 


for 6 arbitrarily chosen such that 0 <6 <1. Also 


(QIAN) pe : (v-1) 
eo) ‘ 8 LD) LC) eee al 
vas 
Ce - 1)/(U' - 1) otherwise 


where L’=aLl+1- aand U'’=aU +1 - a for « arbitrarily 
chosen such that 0 < « < 1 and L and U are as in earlier 
restricted distance functions. The parameters « and 6 serve to 
speed up the convergence of the iterative algorithm used to 
provide a solution. Singh and Mohl (1996) empirically test a 
variety of values for these parameters using large data sets, 
and suggest that « = .67 and 6 = .8 work well in practice. 
Finally, the g-factor at each iteration is 
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&k 


-1 
1+(X- es 2) [Daa PM =) Xe eons 
jes 


where X0"? =Y,.,we” xx; v = 2, 3,... and where wi" = 
a,g{ >; v =2,3,.... Starting values are given by g = 1 
and w. =a,. 

Singh and Mohl (1996) also propose a new distance 
function which changes from iteration to iteration called the 
Shrinkage-Minimization (SM) distance function, and show 
that the estimator resulting from this distance function is also 
asymptotically equivalent to the regression estimator. It is 
given by: 


(We a (v)* Viste 1) 
F'(w, a,) =F, (Wy 5 A) 


= (We sia wala, abe | eV Ie (Oai 1) 
where 
|e Pea Week a. 
-1)* 1 ” 
ae _ U'a, if Wek DST a, WS, Bi on. 


= 
w, ” otherwise. 


Terms in the above equations are defined as follows: 
L'=aL+(1- a), U’=a0U+(1 - a), L” =nL+(1 - n) and 
U" =nU+(1 - n) for « and n arbitrarily chosen such that 
0 <a<r7n< 1. As before, the parameters « and n serve to 
speed up the convergence of the iterative algorithm used to 
provide a solution; Singh and Mohl (1996) suggest that 
a = .67 and n = .9 work well in practice. Finally, 


=1/ il 
we ase ). y =2,3,.. where 


si 2)* 
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1 
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Jes 
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and where X‘’ is as before. Starting values are given by 
a*=a, and w® =a, 

A property of the Modified Huang-Fuller and Shrinkage- 
Minimization distance functions is that the calibration 
constraints (equation (2.2)) are met at every iteration whereas 
the range restrictions on the g-factors are met only upon 
convergence. For the Restricted GLS and Restricted Raking 
Ratio distance functions, the range restrictions on the 
g-factors are met at every iteration whereas the calibration 
constraints are only met upon convergence. Now, it is often 
useful to specify an upper bound on the number of iterations 
to convergence; this feature may be programmed into the 
iterative algorithm for operational expediency. If this upper 
bound is exceeded due to slow convergence, the iterative 
algorithm may be terminated prematurely. Regardless, for the 
Modified Huang-Fuller and Shrinkage-Minimization distance 
functions, the calibration constraints will be met. Likewise, 


for the Restricted GLS and Restricted Raking Ratio distance 
functions, the range restrictions will be met. 

Now, the behaviour of the g-factors from some of the 
distance functions has been studied extensively; see, for 
example, Deville, Sarndal and Sautory (1993). Stukel and 
Boyer (1992) empirically show that the GLS and RR distance 
functions, as well as their restricted counterparts having loose 
bounds imposed on them, give g-factors whose distributions 
over a given data set adhere to normality rather closely. 
However, as the bounds on the restricted distance functions 
are squeezed together more closely, the distributions exhibit 
a “pile-up” of g-factors at the lower and upper bounds. 
Regardless, even under extreme squeezing, the restricted 
distance functions seem to give point estimates that are close 
to their unrestricted counterparts, as the results of our em- 
pirical study will verify. However, the biases of both the point 
and variance estimators under extreme squeezing on the 
restricted distance functions have not been investigated. This 
investigation is of interest to surveys such as the LFS, where 
an augmentation to the current estimation system has been 
implemented, which now allows users the option of choosing 
from amongst the Restricted GLS distance function and the 
Shrinkage-Minimization distance function, in addition to the 
previously available GLS distance function. 


3. VARIANCE ESTIMATION FOR CALIBRATION 
ESTIMATORS 


The exact variance of the calibration estimator ye is 
intractable since the point estimator itself is nonlinear. In 
addition, there is no explicit unbiased method of variance 
estimation. Therefore, approximately unbiased methods, such 
as the Taylor and the Jackknife, are often used in practice. 

Now, for stratified multistage designs, “with replacement” 
sampling is not often used in practice since the possibility of 
drawing the same unit more than once is unappealing. There- 
fore, the preponderance of surveys use “without replacement” 
sampling, at least at the first stage of sampling. Even so, if the 
first stage sampling fraction is small (say, less than 10 percent 
as a rule of thumb), it may be reasonable to use a simplified 
variance formula that assumes “with replacement” sampling 
at the first stage of sampling. For the generalized regression 
estimator (GLS distance function) under a stratified multi- 
stage design this simplification of the variance estimator 
yields: 


Vy (Y,,GrEc)) 

L ny, 2 

ys Toe D Din nik ~ uss 3 Dik nik (3.1) 
h=1 Ny, — i=1 |kes,, Ny i=1 kes, 


where s,; is the sample of individuals in the i-th primary 
sampling unit (PSU) and the h-th stratum, a,,, is the original 
sampling weight under the stratified multi-stage design for 
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sampled individual k in PSU i and stratum h, and n, is 
the number of sampled PSUs in stratum h. Also 
Chix = Vik ~ 5 6 is the estimated residual associated with the 
regression estimator where § =(Y jie @nn% nin Xhie/Chik) 
Vries nit Xnik Yrik/niz~ For many designs, the “with 
replacement” formula given by (3.1) overestimates the true 
variance (see Sarndal, Swensson and Wretman 1992, section 
4.6). Note that although, technically speaking, this simplified 
variance estimator is not the Taylor variance estimator, it is 
often referred to as such for historical reasons and so will it 
be in this paper. 

An improvement to equation (3.1), which includes the 
g-factor in the variance formula (recall that w,, = 4), ji)» 1S 
suggested by Hidiroglou, Fuller and Hickman (1980). It is 
given by: 


2 


We 
mae 2 Wrik "nik | (3.2) 


h t=1 kes,; 


Nh 
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h=l M,~ 1 iv |kes,, 


An analogue of equation (3.2) is also suggested by Sarndal 
(1982) in the context of two-stage sampling, but for Yates- 
Grundy type variance estimators. Now Deville and Sarndal 
(1992) show that any distance function which obeys a set of 
general conditions will produce an estimator that is asympto- 
tically equivalent to the one produced by the GLS distance 
function, that is, se (GREG) given by (2.5). Singh and Mohl 
(1996) extend this "Tesult to include the Modified Huang- 
Fuller and Shrinkage-Minimization distance functions. As a 
result, the asymptotic variance of the calibration estimator ie 
can be considered to be roughly equal to that of te 
This observation leads to a method for estimating the Taylor 
variance which is common to all calibration estimators, 
namely, to estimate the variance of eo using a modification 
of the Taylor variance estimator employed for Voters) 
rather then rederiving the Taylor formula for each of the 
distance functions separately. Thus, whenever a variance 
estimator associated with a distance function different from 
the GLS is required, equation (3.2) is used, replacing the final 
weights {w,,,} from the GLS distance function with those 
from the distance function in question. 

It is straightforward to apply the Jackknife procedure to 
obtain a variance estimator for ie regardless of the distance 
function used to obtain the final calibrated weights. An 
expression for the variance formula under a stratified multi- 
stage design using with replacement sampling at the first stage 
is given by: 

L Md 
Vie) 
h=1 


Zny (Y, (hi) - ¥, (3.3) 


fp Eel 


n 


where Va (hi) is often referred to as the “replicate estimator”; 
“replicates” are formed by taking what remains of the sample 
after removing PSU i from stratum h. Thus, ¥(h i) is 
calculated by recomputing y- after removing the i-th PSU 
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from the h-th stratum, h = 1, ..., L; i= 1, ..., n,, ie., with the 
original sampling weights altered to reflect the PSU removal 
and the g-factors recalculated based on the reduced sample or 
replicate. Finally, the Jackknife estimator is constructed by 
repeatedly removing PSUs one at a time, calculating the 
corresponding replicate estimator, and then assembling the 
final estimator using (3.3). The Jackknife variance estimator 
given by (3.3) is the most conservative among the four varia- 
tions suggested in the extensive discussion on the subject by 
Wolter (1985). 

It is interesting to note that, for the GREG estimator, Yung 
and Rao (1996) obtain (3.2) as an approximation to the 
Jackknife variance estimator given by (3.3); they call (3.2) the 
“Jackknife Linearization Variance Estimator”. Their simul- 
ation study shows that biases (both conditional and uncon- 
ditional) of the Taylor variance estimator (equation (3.1)), the 
Jackknife Linearization variance estimator (equation (3.2)) 
and the Jackknife variance estimator (equation (3.3)) behave 
similarly. While their simulation focuses on variance esti- 
mators for the unrestricted GREG estimator, our simulation 
study, which we discuss next, focuses on variance estimators 
for the GREG as well as for estimators based on other 
restricted and unrestricted distance functions. 


4. MONTE CARLO SIMULATION STUDY 


4.1 Design of the Study 


In order to compare the performance of the calibration 
estimators and their corresponding Taylor and Jackknife 
variance estimators, we undertook a Monte Carlo simulation 
study, in which we investigated their finite sample design- 
based frequentist properties. 

December 1990 Labour Force Survey (LFS) sample data 
for the province of Newfoundland was used to simulate a 
finite population, from which repeated samples were drawn. 
The LFS is the largest ongoing household sample survey 
conducted by Statistics Canada. Monthly data relating to the 
labour market is collected using a complex multi-stage 
sampling design with several levels of stratification. The 
details of the design of the survey prior to the 1991 redesign 
can be found in Singh, Drew, Gambino and Mayda (1990). 
In general, provinces are stratified into “economic regions”, 
which are large areas of similar economic structure; New- 
foundland has four such economic regions. The economic 
regions are further substratified into “self-representing units” 
(SRUs) and “non self-representing units” (NSRUs), which 
are, in turn, further substratified into lower level substrata. 
SRUs are cities whose population exceeds 15,000, such as 
St. John’s and Cornerbrook, in the case of Newfoundland. 
Now, the lowest level of stratification in Newfoundland 
yielded 45 strata, each of which contained less than 6 primary 
sampling units (PSUs), which was an insufficient number 
from which to sample, for the purposes of the simulation. 
Thus, the 45 strata were collapsed down to 18, each 
containing between 6 and 18 PSUs. In collapsing the strata, 
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economic regions were kept intact, as were the Census 
Metropolitan Areas (CMAs) of St. John’s and Cornerbrook. 

For the Monte Carlo study, R = 4,000 samples, each of size 
approximately 1,000, were drawn from the Newfoundland 
“population” (which was of size 9,152), according to a two- 
stage design. For collapsed strata belonging to NSRUs, two 
PSUs were selected at the first stage using Probability 
Proportional to Size (PPS) with replacement (WR) sampling, 
where the size measure used was the number of dwellings in 
the PSU. At the second stage, one in five dwellings were 
selected from the sampled PSUs using Simple Random 
Sampling (SRS) without replacement (WOR). For collapsed 
strata belonging to SRUs, three PSUs were selected at the 
first stage using PPS WR sampling. At the second stage, all 
the dwellings in the sampled PSUs were selected, reducing 
this part of the design to one-stage take-all cluster sampling. 
This feature was necessary since there were not enough 
dwellings per PSU to subsample in SRUs. The selection of 
two PSUs in NSRU strata versus three in SRU strata was 
driven by the fact that, in general, NSRU strata had fewer 
population PSUs from which to sample than did SRU 
strata. In all, there were 47 sampled PSUs. In either case 
(NSRUs or SRUs), all dwelling members were included in the 
sample. Although this design is a hybrid between a one and 
two-stage design, we shall refer to it as a two-stage design, for 
convenience. 

We took Y, the total number of unemployed, to be the 
parameter of interest. This was calculated from the finite 
population by: ¥ =)", 1, ¥, = wes y, where y, = 1 if individual 
k was unemployed; 0 otherwise. For each of the R = 4,000 
samples, we calculated Yin the estimated total number of 
unemployed as Ves = Vecs,¥,- The {w,:kes} were deter- 
mined by the following six distance functions discussed 
earlier: 

(1) the Generalized Least Squares (GLS) Distance Function 
(equation (2.4)), 

(2) the Raking Ratio (RR) Distance Function (equation 
(2.7), 

(3) the Restricted GLS (RGLS) Distance Function (equation 
(2.8)), 

(4) the Restricted RR (RRR) or Logit Distance Function 
(equation (2.9)), 

(5) the Modified Huang-Fuller (MHF) Distance Function 
(a = .67, 6 = .8) (equation (2.10)), and 

(6) the Shrinkage-Minimization (SM) Distance Function 
(a = .67, 7 = .9) (equation (2.11)). 

For the latter four distance functions, the following four 
sets of bounds were imposed on each to restrict the 
minimization: (i) L = 0, U =4, (ii) L=.4, U = 2, (iii) L= .68, 
U = 1.6 and (iv) L = .8, U = 1.3. This yielded a total of 
eighteen point estimators. For each of the eighteen point 
estimators, the calibration used auxiliary information based 
on Census projections at the province level for 10 mutually 
exclusive and exhaustive age/sex categories (age categories: 
< = 14, 15-24, 25-44, 45-64, > = 65 crossed with the two 
sexes) and the four economic regions of Newfoundland. 


Thus, the auxiliary information for each individual was a 
vector of length fourteen having exactly two ones and twelve 
zeros. However, for computational purposes, the dimen- 
sionality of the vector had to be reduced to thirteen when 
using the Newton-Raphson procedure to solve equation (2.3). 
For the first four distance functions, we set c, = 1. 

For each of the R = 4,000 samples and each of the eighteen 
point estimators, we calculated the Jackknife variance esti- 
mator given by equation (3.3). We also calculated the Taylor 
variance estimator given by equation (3.2), and the modifica- 
tion suggested in section 3 was used for distance functions 
other than the GLS. Note that since PPSWR, rather than 
PPSWOR, was used at the first stage of sampling, the use of 
the variance estimator given by equation (3.2) was entirely 
appropriate for our simulation. Finally, for the GLS distance 
function only, the formula (3.1) was calculated to observe the 
impact of omitting g-factors from the variance estimator. 

For each of the six distance functions given above, a 
number of frequentist properties were investigated. These are 
given below. 


(A) The Percent Relative Bias of the Estimated Number of 
Unemployed (with respect to the population value) is 
estimated by: 


100 (4.1) 


where 


is the Monte Carlo expectation of the point estimator Yo 
taken over the R samples, and Y, is the value of oh for 
sample r. : 

(B) The Percent Relative Bias of the Taylor/Jackknife 
Variance Estimator (with respect to the true variance) is 
estimated by: 


(Ey, (V(Y,)) - Vin.) : 


100 4.2 
v_ (4.2) 
where 
pote ean 24 
Ey (V(Y,)) =— > V(Y,) 
R r=1 
and 


and V,(Y,) is the value of V(Y,) (Taylor or Jackknife) for 
sample r. 

(C) The Percent Coefficient of Variation of the Taylor/ 
Jackknife Variance Estimator (with respect to the true 
variance) is estimated by: 
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+ 100 (4.3) 


i.e., the root mean squared error of the variance estimator 
divided by the true variance, expressed as a percentage. 
Although most studies focus on the bias of the variance 
estimators, it is also of secondary interest to look at the 
coefficient of variation of the variance estimators to see how 
variable the variance estimates themselves are. 

Note that in equations (4.2) and (4.3), it may have been 
more appropriate to make comparisons relative to a “true 
mean squared error” rather than a “true variance”. However, 
for our simulation, the relative biases were so small that the 
differences between the two types of comparisons are vir- 
tually negligible. 

Finally, in order to assess the appropriateness of the choice 
of number of repeated samples, we calculated Monte Carlo 
errors, using as a measure the Percent Coefficient of Varia- 
tion of E,(V,)), given by: 


R? Al (4.4) 
EqW(Y))) 
The Monte Carlo errors were found to be consistently low 


(between .99% and 3.60%) for both the Jackknife and Taylor 
using R = 4,000, indicating stable results. 
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4.2 Results of the Study 


Table 1 gives the Percent Relative Bias of the Point Esti- 
mators (equation (4.1)) as well as the Percent Relative Bias of 
the Taylor and Jackknife Variance Estimators (equation (4.2)) 
and the Percent CVs of the Taylor and Jackknife Variance 
Estimators (equation (4.3)). The percent relative bias for all 
the point estimates (column two) is negligible, ranging in 
value from 0.10% to 0.52%, but much less than 1% in all 
cases. The fact that all point estimates have a similar bias 
seems reasonable, given the asymptotic equivalence of all 
calibration estimators to the regression estimator. 

The third column gives the percent relative bias of the 
Taylor variance estimator. Here, the true variance is always 
underestimated, but never by more than 6.2%. In the case of 
the regression estimator, it appears to make little difference 
whether or not the g-factor is included in the variance formula 
(equation (3.1) versus (3.2)); the bias improves only slightly 
for the case of the g-factor included (-5.82% versus - 6.01%). 
The Jackknife variance estimator (column four), on the other 
hand, outperforms the Taylor variance estimator uniformly. 
The Jackknife almost always underestimates the true variance, 
but by less than 2% in all cases. 

To produce a solution, all distance functions but the GLS 
required an iterative algorithm. This being the case, some of 
the 4,000 samples experienced convergence problems, parti- 
cularly in the case of extreme bounding on the g-factors. 
Those samples for which the algorithm did not converge were 
discarded. Thus, they did not contribute to the various Monte 
Carlo measures. The number of such discarded samples is 


Table 1 
Percent Relative Bias of the Point Estimators, and Percent Relative Bias and Percent CV of the Taylor and 
Jackknife Variance Estimators (Sample Size About 1000) 


Percent Percent Percent P cv P cv Number of 
: . Relative Relative Relative aes i iG f Discarded 
Distance Function Bias Point Bias Taylor Bias Jackknife ce 2 tee ee Samples 

Estimator Variance Variance le ae C7 (From 4000) 

GLS (Regression) JU -6.01 (eq 3.1) -1.73 60.79 (eq 3.1) 62.86 0 
-5.82 (eq 3.2) 59.60 (eq 3.2) 

Restricted GLS (L=0, U=4) ali = Sys -1.73 59.60 62.86 0 
(B= rae = 2) 10 -5.36 -1.27 59.93 63.21 S2 
Raking Ratio oy -6.20 0.84 59.45 63.35 0 
Restricted RR (Ga08U-=4) 50 -6.09 =031 59.48 63.47 0 
(Caran) 46 -5.69 -0.39 59.81 64.21 32 
Modified G=07Ui=4) alli -5.82 =1.73 59.60 62.86 0 
Huang-Fuller (G=745U=2) 10 5130 -1.20 59.94 63.27 32 
Shrinkage- (L=0, U=4) ul -5.82 -1.73 59.60 62.86 0 
Minimization (EB =245 0 =2) 10 -5.36 -1.27 59.94 63.25 32 
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indicated in the last column of Table 1. In the case of extreme 
bounds (L = .68, U = 1.6 and L = .8, U = 1.3), so many 
samples were discarded (between 231 and 234 for the cases 
L = .68, U = 1.6 and between 1,562 and 1,602 for the cases 
L=.8, U = 1.3) that the results were not considered reliable, 
and so are not reported here. However, these tighter bounds 
were of interest, so the simulation was rerun using 
approximately double the sample size (increase from roughly 
1,000 to 2,000). Note that Deville and Sarndal (1992) show 
that convergence is achieved for all distance functions with 
probability one as the sample size increases. 

Columns five and six of Table 1 give the Percent CVs of 
the Taylor and Jackknife Variance Estimators. The coeffi- 
cients of variation are similar for all distance functions, 
ranging in value from 59.45% to 64.21%. However, the CVs 
corresponding to the Jackknife are always slightly larger than 
that of Taylor. Coefficients of variation of this magnitude, 
although large, have been encountered in other simulation 
studies relating to variances. See, for example, Kovacevic, 
Yung and Pandher (1995). However, we were interested in 
seeing if the key results relating to the bias of the variance 
estimators would still hold if the CVs were lowered. 
Therefore, at the suggestion of a referee, we reran the simu- 
lation, increasing the number of PSUs drawn from 47 to 83, 


since CVs of variance estimators are known to be approxi- 
mately inversely related to the number of PSUs drawn. The 
PSUs were increased in such a way that the overall design 
was made self-weighting; this approach appeared to have the 
greatest effect on lowering the CVs. The second stage of 
sampling remained the same as before. Rerunning the simu- 
lation had the secondary benefit of roughly doubling the 
sample size, and thus, solving the convergence problems 
referred to in the last paragraph. 

The results from the second run of the simulation are 
reported in Table 2. The last column in Table 2 shows the 
reduced number of discarded samples due to convergence 
problems. The fifth and sixth column of this table show that 
the CVs are significantly reduced to between 22.70% and 
24.2% with the Jackknife consistently exhibiting slightly 
higher values. Now, as before, the percent relative bias in the 
point estimator is negligible, always being well under 1%. In 
the previous run, the percent relative biases for the Taylor 
estimator were always roughly - 6%; here, they are always 
about -3%, again implying underestimation of the true vari- 
ance. Once more, in the case of the GLS distance function, 
there is very little difference in the bias that results from using 
equation (3.1) versus (3.2). The percent relative bias in the 
Jackknife estimator (always roughly - 1.5%) is consistently 


Table 2 
Percent Relative Bias of the Point Estimators, and Percent Relative Bias and Percent CV of the Taylor and 
Jackknife Variance Estimators (Sample Size About 2000) 


Percent Percent 
Distance Function Relative Relative 
Bias Point Bias Taylor 
Estimator Variance 


Percent Percent CV Percent CV Number of 

Relative Taylor Jackknife Discarded 
Bias Jackknife Variance Variance Samples 

Variance (From 4000) 


a ee re) ee a 


GLS (Regression) .02 =Zaile(Eqese)) 
-2.61 (eq 3.2) 
Restricted GLS (L= 0; U=4) .02 PHN 
(ana aU = 2) .02 =PANI| 
(L=.68, U=1.6) .02 -2.61 
(= F8 Ui = Nes) .02 = DS 
Raking Ratio 2 =U: 
Restricted RR (L=0; U=4) 17 O71) 
(EAN) .16 -2.70 
(L=.68, U= 1.6) ail =e) 
(C=.8)0=1'3) Al -2.91 
Modified (L=0,U=4) .02 20 
Huang-Fuller (L=74510 = 2) .02 =AA(5 
(L= 68; U=1.6) .02 =P {HI 
(L=.8, U=1.3) .02 -2.58 
Shrinkage- (E=050= A) .02 =P il 
Minimization (oS 4's (=) .02 ~2.61 
(L=,68) U = 1.6) .02 -2.61 
(=F S50 = 153) .02 =A AHI 


-1.43 23.03 (eq 3.1) 23.29 0 
22.84 (eq 3.2) 

-1.43 22.84 23.29 0 
-1.43 22.84 23.29 0 
-1.44 22.84 23.29 0 
- 1.56 22.70 23.15 118 
-1.15 22.84 23.43 0 
- 1,36 22.84 23.30 0 
BaD 22.84 23.29 0 
-0.49 22.83 24.20 0 

* 22.70 * 118 
-1.43 22.84 23.29 0 
-1.43 22.84 23.29 0 
-1.44 22.84 23.29 0 
~ 1,36 22.73 23.18 116 
-1.43 22.84 23.29 0 
-1.43 22.84 23.29 0 
-1.44 22.84 23.29 0 
-1.24° 22.73 23.63 118 
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smaller in absolute value than that of Taylor. For the Jack- 
knife estimator, there is one case (Restricted RR (L = .8, 
U = 1.3)) where there were convergence problems; those 
results are omitted, indicated by a “**”. Surprisingly, for both 
the Taylor and Jackknife, there is virtually no change in bias 
for the restricted distance functions as the bounds are made 
successively more tight. In fact, there seems to be very little 
difference in the percent relative bias across all of the distance 
functions, for both the Taylor and the Jackknife. Note that for 
the rerun of the simulation, the Monte Carlo errors ranged 
between .37% and 2.13%. 


5. CONCLUSIONS 


This paper focused on exploring the behaviour of point 
estimators and their corresponding Taylor and Jackknife 
variance estimators for a number of different distance 
functions available through calibration theory. Particular 
emphasis was given to those distance functions which 
allowed range restrictions to be imposed on the g-factors, 
eliminating the possibility of negative and high positive final 
weights. All of the point estimators which were investigated 
exhibited a negligible bias. 

Both the Jackknife and Taylor variance estimators 
exhibited small underestimation of the true variance, although 
the Jackknife consistently had smaller biases (in absolute 
value) than the Taylor. The most striking result was that, for 
both Taylor and Jackknife, the biases remained roughly the 
same in the cases of extreme bounding on the g-factors as in 
the cases of less restrictive bounding. In general, however, 
caution should be exercised in the use of extreme bounds, due 
to the convergence problems that may be experienced, 
particularly when Jackknifing is used for variance estimation 
and the point estimators must be recalculated repeatedly. If 
the main objective of using the restricted distance functions 
is to eliminate the possibility of negative or high positive 
weights, then modest bounds on the g-factors should suffice. 

As a final remark, it is interesting to note that roughly 97% 
of the computing time was spent Jackknifing while the 
remaining 3% was spent on Taylor linearization. This rather 
extreme difference in computation time may give the Taylor 
method an advantageous edge if measures of precision are 
required for a large number of domains. However, given 
recent developments in the computational efficiency of the 
Jackknife variance estimator (for example, the program 
WESVARPC (1995)), it may be possible to offset this im- 
balance. Even so, it should be noted that, at this time, 
WESVARPC has improved the computational efficiency for 
designs having only two PSUs per stratum, and poststratified 
estimators having only one dimension. 

In conclusion, since our study does not conclusively show 
either variance estimator to be clearly superior and shows 
both to behave reasonably well for all distance functions, it is 
up to the user to decide which variance/ distance function 
combination best fits the system requirements. 
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An Application of Restricted Regression Estimation 
in a Household Survey 


BODHINI R. JAYASURIYA and RICHARD VALLIANT' 


ABSTRACT 


This paper empirically compares three estimation methods — regression, restricted regression, and principal person — used 
in a household survey of consumer expenditures. The three methods are applied to post-stratification which is important 
in many household surveys to adjust for under-coverage of the target population. Post-stratum population counts are 
typically available from an external census for numbers of persons but not for numbers of households. If household 
estimates are needed, a single weight must be assigned to each household while using the person counts for 
post-stratification. This is easily accomplished with regression estimators of totals or means by using person counts in each 
household’s auxiliary data. Restricted regression estimation refines the weights by controlling extremes and can produce 
estimators with lower variance than Horvitz-Thompson estimators while still adhering to the population controls. The 
regression methods also allow controls to be used for both person-level and household-level counts and quantitative 
auxiliaries. With the principal person method, persons are classified into post-strata and person weights are ratio adjusted 
to achieve population control totals. This leads to each person in a household potentially having a different weight. The 
weight associated with the “principal person” is then selected as the household weight. We will compare estimated means 
from the three methods and their estimated standard errors for a number of expenditures from the Consumer Expenditure 


survey sponsored by the U.S. Bureau of Labor Statistics. 


KEY WORDS: Calibration; Principal person method; Replication variance; Restricted regression. 


1. INTRODUCTION 


A signal problem in large household surveys is under- 
coverage of the target population often arising from 
differential response rates among population subgroups and 
frame deficiencies. Post-stratification is one method used at 
the estimation stage to reduce mean square errors based on 
information that affect the response variables. The estimator 
is constructed in such a way that the estimated total number of 
individuals falling into each post-stratum is equal to the true 
population count. Post-stratum population counts are typically 
available from an external census for numbers of persons but 
not always for numbers of households. If household estimates 
are needed, a single weight must be assigned to each house- 
hold while using the person counts for post-stratification. 
Regression estimators of totals or means accomplish this by 
using person counts in each household’s auxiliary data. 
Restricted regression estimation controls extreme weights and 
can produce estimators with lower variance than the Horvitz- 
Thompson estimator while still adhering to the population 
controls. An alternative used by some surveys is the Principal 
Person (PP) method (Alexander 1987) in which the household 
weight is based on the individual designated as the“principal 
person” in each household. Persons are classified into 
post-strata and person weights are ratio adjusted to achieve 
population control totals, leading to the possibility that each 
person in a household may have a different weight. The 
weight associated with the principal person is then assigned to 
the household. This ad hoc method is difficult to analyze 
theoretically. The regression estimators discussed in this 


paper, while easily adjusting for the population under-count, 
automatically provide a household weight that is not based on 
any particular one of its members. Lemaitre and Dufour 
(1987) address Statistics Canada’s use of the regression 
estimator in this regard. 

There are a growing number of precedents for the use of 
regression estimators in surveys both in the theoretical 
literature and in actual survey practice. Statistics Canada has 
incorporated the general regression estimator into its 
generalized estimation system (GES) software that is now 
used in many of its surveys (Estevao, Hidiroglou and Sarndal 
1995). Fuller, Loughin and Baker (1993) discuss an 
application to the USDA Nationwide Food Consumption 
Survey. One of the attractions of regression estimation is that 
many of the standard techniques in surveys including the 
post-stratification estimator mentioned above are special cases 
of regression estimators. The regression estimator also more 
flexibly incorporates auxiliary data than other more common 
methods. In a household survey, for example, both person- 
level and household-level auxiliaries that can be qualitative or 
quantitative are easily accommodated. Other works related 
to regression estimation and post-stratification include 
Bethlehem and Keller (1987), Casady and Valliant (1993), 
Deville and Sarndal (1992), Deville, Sarndal and Sautory 
(1993) and Zieschang (1990). 

In this study we compare the regression estimator with the 
PP estimator currently in use at the Bureau of Labor Statistics 
(BLS). Each estimator can be written in the form of a 
weighted sum of the sample values of the response variable. 
Then each weight is traditionally interpreted as the number of 
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individuals in the population who would have the correspond- 
ing value of the response variable. This interpretation requires 
that each weight be greater than or equal to one. The ordinary 
least-squares regression estimator has the disadvantage that it 
can produce non-positive weights. A number of ways are 
suggested in the literature on how to overcome this problem. 
Possibly the easiest is the method introduced by Deville and 
Sarndal (1992) which can remove any negative weights as 
well as control extreme weights. The restricted regression 
estimators produced by these new weights are also compared 
to the original regression estimator and the PP estimator. 

In Section 2, the three different estimators are presented. 
Section 3 is an application of these procedures to the 
Consumer Expenditure (CE) Survey at BLS — the same setting 
as in Zieschang (1990). We compare the coefficients of 
variation for a number of the survey target variables for the 
full population and for a number of domains. Section 4 
provides a summary of our conclusions. 


2. REGRESSION, CALIBRATION 
AND PRINCIPAL PERSON 
ESTIMATION 


First, we give a brief introduction to the regression 
estimator. A sample s of size n is selected from a finite 
population U of size N. Let the probability of selection of the 
i-th unit be ™;. The sample could be two-stage and the unit 
could be either the primary sampling unit or the secondary 
sampling unit. There is no need here to complicate the 
notation with explicit subscripts for the different stages of 
sampling. Let the variable of interest be denoted by y and 
suppose that its value at the i-th unit, y,, is observed for each 
tes. Assume the existence of K auxiliary variables x,,x,,...,X, 
whose values at each ié€s are available. Define 
Xj (XA Xe) OL cach Fe U. where x, denotes the 
value of the variable x, atuniti. Let X = (X,,....X,)' denote 
the K-dimensional vector of known population totals of the 
variables x,,x,,...,X,. The regression estimator is then 
motivated by the working model &: 


Y; = By + BoX_ +--+ Brix +; (271) 


fori=1,...,N. Here, f,,...,B, are unknown model parameters. 
The €, are random errors with E, (€,) = 0 and var, (€;) = G, for 
i= 1, ..., N. The term “working model” is used to emphasize 
the fact that the model is likely to be wrong to some degree. In 
the CE, the unit of analysis, indexed by i, is a consumer unit 
(CU), which is similar to a household and defined in more 
detail in Section 3. The value y, might be the total food 
expenditures by the CU and the x,,’s might be various CU 
characteristics like numbers of people of different ages, or CU 
income, that have an effect on the CU’s expenditure on food. 
The variance of expenditures might be dependent on CU size 
so that having a proportional to the number of persons in the 
CU might be reasonable. We include an intercept in some of 
our models by setting the first auxiliary variable, x,, equal to 1. 


A linear regression estimator of the population total of y is 
defined to be 


Ip =I, + (X-F,)'B (2.2) 


where y, denotes the t-estimator (or Horvitz-Thompson 
estimator) of the population total of y, i.e., 


§, = ay, (2.3) 


ies 


with a,=1/t,. Also, ¥, = (Xi ong) IS teu VECLORs OF 
t-estimators of the population totals of the variables 
aedeagal 


N(peibee yy 


ies oO i 


sea 
ES es 


We assume that )’,..a,x x / o. is nonsingular. Even if model 
(2.1) fails to some degree, ¥,/N is a design consistent 
estimator of the population mean Y irrespective of whether 
the assumed model is true or false. This is clear from (2.2). If ¥_/N 
and ¥_/N are design consistent estimators of Y and of X , 
the vector of population means of the auxiliaries, then the 
second term in y,/N converges to zero while the first 
converges to Y . For more details, see Sarndal, Swensson and 
Wretman (1992). 

The regression estimator ¥, can also be expressed as a 
weighted sum of the sample y,’s, which is a desirable feature 
for survey operations. It is easily seen that (2.2) can be 
re-written as ¥, = )),..w,y, with 


= x; 
1+(X-#,)'A 1 


0; 


W.=a. 
i i 


(2.5) 


where A = })),.,a,x,x// Go, . The weights do depend on the 
sample through the x,’s that are in the sample, but this is also 
true of many survey estimators, including the post- 
stratification estimator. However, these weights do not depend 
on the particular y variable being studied, implying that one 
set of w, weights can be used for all estimates. 

A mean per unit is estimated in the obvious way: 
Vp = $,/N where N = DicsW;- If we estimate the totals of the 
auxiliaries x ,, then 


ies 


Whey lat (Xe ts) Ae soe 
» » ‘ a, (2.6) 


50.6 


l.€., we reproduce the known population totals. This is also a 
characteristic of the post-stratification estimator. 

The estimator of B in (2.4) does not account for any 
correlation among the errors in model (2.1). In clustered 


Survey Methodology, December 1996 


populations, units that are geographically near each other, e.g., 
CU’s in the same neighborhood, may be correlated. Using a 
full covariance matrix V may be more nearly optimal (e.g., see 
Casady and Valliant 1993 and Rao 1994). Though use of a full 
covariance matrix V may lower the variance of 6, the 
elements of V will depend on the particular y being studied, 
and estimation of V is generally a nuisance. Consequently, it 
is interesting and practical to consider the simple case of 
V = diag CH ) that leads to (2.2). Note that when the design- 
variance var , (Pp) is estimated, it will be necessary to use a 
method that properly reflects clustering and other design 
complexities. 

The regression estimator has the disadvantage that the 
weights can be unreasonably large, small or, even negative. 
The restricted calibration estimators of Deville and Sarndal 
(1992), introduced next, add constraints to control the size of 
the weights. Calibration estimators are formed by minimizing 
a given distance, F, between some initial weight and the final 
weight, subject to constraints. The constraints can involve the 
available auxiliary variables thus incorporating them into 
the estimator. The regression estimator presented above is a 
special case of the calibration estimator in which F is defined 
to be the generalized least squares (GLS) distance function, 


a.c.{ w, 2 
F(w,a,) ae 2 ! 
y a 


fori=1,...,n, with c, aknown, positive weight (e.g., c; = Go, 
or c,=1) associated with unit i, and w,, the final weight. The 
total sample distance )’,.. F (w,,a,) is minimized subject to the 
constraints, 


ies 
> w,x, =X. (2.7) 
ies 


In this form, the weights of the regression estimator of the 
population total of y given in (2.5) can be written as, 


Ww; = agicr ax) (2.8) 
fOGE—slaeeeiwhere 

g(u)=l1t+u, (2.9) 

for ué ® and A is a Lagrange multiplier evaluated in the 

minimization process. The particular form of w, with c, = a, 


for the regression estimator was given in (2.5). To eliminate 
extremes, the weights can be refined by restricting g so that 


L if pu L Al 


PC Nit tae Hada i <tc a) (2.10) 
U ti > Hel: 
With this definition of g, the weights w, satisfy 
L<w,/a,<U (2.11) 
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for i= 1, ..., so that L and U can be chosen in such a way as 
to reflect the desired deviation from the initial weights a;. 
Choosing L > 0 ensures that the weights are positive, and U is 
picked to be appropriately small to prohibit large weights. The 
restricted regression weights must be solved for iteratively; 
one easily programmed algorithm is given in Stukel and Boyer 
(1992). Another method of restricting weights is ridge 
regression as used by Bardsley and Chambers (1984). 

In most household surveys, post-stratification serves 
primarily as an adjustment for under-coverage of the target 
population by the frame and the sample. In the U.S., there are 
few reliable population counts of households to use in 
post-stratification. Consequently, population counts of persons 
are usually used for the post-strata control totals. This 
disagreement in the unit of analysis (the household) and the 
unit of post-stratification (the person) when a household 
characteristic is of interest led to the development of the PP 
method that is used in the CE and Current Population Surveys. 

In the PP method described in Alexander (1987), a 
household begins the weighting process with a single base 
weight, a;, that is then adjusted for non-response. The 
adjusted weight is assigned to each person in the household 
and the person weights are then further adjusted to force them 
to sum to known population controls of persons by age, race, 
and sex. This last adjustment can result in persons having 
different weights within the same household. The household 
is then assigned the weight of the person designated as the 
“principal person” in the household. This method has an 
element of arbitrariness and is difficult to analyze mathe- 
matically. The intent of this research was not to see if the PP 
method could be improved upon, but rather to use the current 
implementation of PP as a convenient baseline for measuring 
the performance of other estimators. 

The regression and restricted regression estimators can be 
formulated in such a way that population person controls are 
satisfied, all persons in a household retain the same weight, 
and no arbitrary choice among person weights is needed to 
assign a household weight. This is accomplished by defining 
the auxiliary variables at the household level. For example, if 
there were three age post-strata and household i has 1, 0, and 
2 persons in these post-strata, the auxiliary data vector would 
be x, = (1,0,2)’. Note that this formulation is different from 
Lemaitre and Dufour (1987) who defined the auxiliary 
variables at the person level and assigned the average of the 
household data — (1/3, 0, 2/3) in the example — to each person. 
Those authors used this “average” method because they were 
interested in estimates both for persons, e.g., number 
employed, and for households, e.g., economic families. We, 
on the other hand, need only a household weight since our 
target variables (i.e., y) like shelter or utility expenditures are 
collected at the household level. 


3. AN APPLICATION 


We compare the three estimators (i.e., regression, restricted 
regression (with L = .5, U = 4), and principal person) by an 
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application to the estimated means and their estimated 
standard errors for a number of expenditures from the CE 
Survey sponsored by the Bureau of Labor Statistics. 

The CE Survey gathers information on the spending 
patterns and living costs of the American consumers. There 
are two parts to the survey, a quarterly interview and a weekly 
diary survey. The Interview Survey collects detailed data on 
the types of expenditures which respondents can be expected 
to recall for a period of three months or longer (e.g., property, 
automobiles, major appliances) — an estimated sixty to seventy 
percent of total household expenditures. The Diary Survey is 
completed at home by the respondent family for two 
consecutive 1-week periods and collects data on all the 
expenses of the family in that time period. The sample is 
selected in two stages with geographic primary sampling units 
at the first stage and households at the second. 

We evaluated the estimators described above for a number 
of expenditures from the Interview Survey. Data collected 
during the second quarter of 1992 consisting of n = 5156 
CU’s were used. The CE Survey’s primary unit of analysis is 
the consumer unit, an economic family within a household. A 
consumer unit (CU) consists of individuals in the household 
who share expenditures. Thus, there may be more than one 
CU in a household. 

Five different sets of auxiliary variables (x,’s in the 
notation of Section 2) were studied. They were chosen by 
testing the adequacy of model (2.1) for the selected 
expenditures with different combinations of the available 
auxiliary variables. Combinations of auxiliaries were 
identified in which each estimated regression coefficient was 
significant in an ordinary least squares regression at the 5% 
level. A key step that substantially improved the fit of the 
models was simply including an intercept. Factored into the 
selection of auxiliaries was also the knowledge that the survey 
has more under-coverage of Blacks than non-Blacks and that 
this needed to be accounted for by post-stratification. We 
viewed this method of variable selection as exploratory and, 
consequently, a number of combinations were studied to 
determine which set produced the best estimators of mean 
expenditures. The 56 post-strata based on age/race/sex 
currently in use in the CE were included. (The 56 are routinely 
collapsed in actual CE operations because of small sample 
sizes in some cells.) Other variables that were statistically 
significant in various combinations were region (NE, MW, 
S, W), urbanicity (urban/rural) by region, age of reference 
person of the CU (< 25, 25-34, 35-44, 45-64, 65+), household 
tenure (owner/renter), income before taxes of the CU, and the 
56 post-strata collapsed by sex and some of the age categories 
to form 10 age/race categories. Based on this information, 
weights (2.8) were computed using g given in (2.9) — regwts 
— and (2.10) — calwts. For both the regression and restricted 
regression weights, we set a, equal to the adjusted base 
weight, i.e., 1/7, times a non-response adjustment. In order 
for the matrix A in Section 2 to be nonsingular, one of the 
categories in some auxiliaries, like region, was omitted from 
each x ,. For this application, the population totals necessary 


to evaluate X = (X,,...,X,)' were obtained mostly from the 
Statistical Abstract of the United States (1993) whose sources 
are the 1990 Census figures and the Current Population 
Reports published by the U.S. Bureau of the Census. When an 
intercept is used, the appropriate control total for that variable 
is the number of CU’s in the population for which we used the 
PP estimate as a surrogate. The combinations of auxiliaries 
used to form the different weights are given in Table 1. 
RegwtsO, with 56 age/race/sex post-strata uses the largest 
number of post-strata. The 56 are the starting point for the PP 
method but are usually collapsed to 30-40 because of small 
cell sizes. When computing calwts0, those 56 post-strata were 
collapsed to 45 since the constraints imposed by the L and U 
bounds could cause singularity in the matrix based algorithm. 


Table 1 
Weights and Their Corresponding Auxiliary Variables 


Weights Auxiliary Variables K 
regwtsO Age/race/sex 56 
regwts 1 Intercept, age/race/sex, region, urban x region 18 
regwts2 Intercept, age/race/sex, region, urban x region, 24 
age of reference person, housing tenure, 
family income before taxes 
calwtsO Age/race/sex 45 
calwts1 Intercept, age/race/sex, region, urban x region 18 
calwts2 Intercept, age/race/sex, region, urban x region, 24 
age of reference person, housing tenure, 
family income before taxes 
calwts3 Intercept, age/race/sex, region, urban x region, 19 
family income before taxes (truncated at 
$500,000) 
calwts4 Intercept, age/race/sex, region, urban x region, 23 
age of reference person, housing tenure 
12 Age/race/sex 56' 


' The initial set of 56 is usuaily collapsed to 30-40 because of small sample 
sizes in some cells. 


3.1 Comparisons of Weights 


A variety of comparisons of weights produced by the 
different methods were made, only a few of which can be 
mentioned here. Figure 1 shows plots of the PP weights, 
regwts0, calwts0, and calwtsl versus the adjusted base 
weights. For PP and regwts0, the adjustments to go from a; to w, 
are much more variable than for calwtsO and calwts1, which 
employ the L = 0.5 and U = 4 restrictions. High variability 
among the w, can lead to expenditure estimates with high 
variance and to poor confidence interval coverage since large 
sample normality may not hold. Even though (2.11) implies 
that a,/2< w, < 4a, for each i for the calwts, the lower right 
panel in Figure 1 shows that the calwtsl satisfy 
a,/2<w,< 2a., for each i. Thus, setting U = 2 or 3 would 
have little effect on calwts1. CalwtsO would have been slightly 
affected by setting U = 2 since a few points were outside the 
upper reference line. The upper two panels indicate that the 
PP weights and regwtsO do not conform to the restriction 
a,l2<w;<2a,. 
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Figure 1. Four sets of weights plotted versus adjusted base weights 
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The concern about negative regression weights was minor 
in the application. In the full sample, only one CU had a 
negative weight for regwts1 and regwts2 while regwtsO had no 
negative weights. However, in the replicates used for variance 
estimation, described in Section 3.2, 2 or 3 CU’s did have 
negative weights in many replicates so that using the L 
restriction was more important there. 


3.2 Precision of Estimates from the Different Methods 


Although comparison of weights is instructive, the methods 
must ultimately be judged based on the level of estimated CU 
means and their precision. The standard errors of these 
estimators were computed via the method of balanced half 
sampling (BHS) using 44 replicates as currently implemented 
in the CE for the PP estimator. The BHS estimator is 
constructed to reflect the stratification and the clustering that 
is used in the CE. A half sample is constructed in a prescribed 
way (McCarthy 1969) to contain one half of the first-stage 
sample units in a survey. Defining the mean per CU based on 
CU’s in half-sample @ to be Vp.) and that for the full sample 
to be Yr the BHS estimate of variance is Vans (Vea) = 
ou iy. R(a) -y Vp) /44. To compute each y ray? the same 
estimation steps used for the full sample are repeated for the 
CU’s in the half-sample. As the expenditure estimates from 
the CE Survey are published for various inter domains of 
interest, we computed the means and the standard errors for a 
few chosen domains as well. For each of these, the coefficient 
of variation (cv) was computed and then its ratio to the cv of 
the PP weight estimate was calculated. 

For each type of weight, if the ratio of each expenditure cv 
to that of the PP weights is less than one, an improvement over 
the PP estimate is indicated since, for all the weights, the 
expenditure mean estimates were very close to those of the PP 
estimates. We computed the ratios of cv’s and the ratios of 
means for each of the sets of weights described in Table 1, for 
each of the chosen expenditures, and for each of the following 
domains: 


(1) Age of Reference Person: < 25, 25-34, 35-44, 45-54, 
55-64, 65+ 

(2) Region: NE, MW, S, W 

(3) Size of CU: 1, 2, 3, 4, 5+ 

(4) Composition of Household: Husband and wife only, 
Husband and wife + children, Other Husband and wife, 
One parent + at least one child < 18, Single person and 
other CU’s 

(5) Household Tenure: Owner, Renter 

(6) Race of Reference Person: Black, Non-Black. 


We will discuss only domains (1) — (3) here. In addition, 
ratios for all CU’s, i.e., the total across the domains, were 
computed for each expenditure and are shown in Table 2. For 
All Expenditures, regwts2, calwts2, and calwts3, with ratios 
of .79, .78, and .75, provide substantial reduction in cv 
compared to PP. For less aggregated expenditures regwts1 or 
calwts1 provide reasonably consistent improvements over PP 


Table 2 


Ratios to PP cv of cv’s for the Different Weighting Methods 
The Minimum Ratio is Highlighted in Each Row 


regwts calwts 
Expenditure tL oS 
0 1 2 0 1 2 3 4 

All expenditures 0:98 0.90 0.79 (0.98 (0.90 0.78 =835 0.87 
Shelter 0.93 085 0.75 0.93 085 0.74 0.72 0.84 
Utilities 1.08 1.03 0.94 1.07 1.03 0.88 :@9% 0.92 
Furniture 1.08 213 52a te 121 2 58a a2 Sie lols, 
Major appliances 1.08 1.06 1.04 1.06 1.08 1.09 1.00 1.03 
All vehicles 0.90 089 0.98 0.91 089 0.98 0.97 0.90 


New cars, trucks 0.95 891 1.01 096 91 1.02 1.02 091 

Used cars, trucks 0.98 0894 0.96 0.97 094 0.97 0.96 0.95 
Gasoline, motor oil 17 el P03 1-12) 110: 2298910 940 10 
Health care 1.05 0.97 0.86 1.07 0.97 ©@85 0.87 0.94 
Education 0.92 0:93 1.04 091 ©93 1.06 1.07 0.88 
Cash contributions HOF 1.02 1.28 «4:63 1.02 1.30 1.29 1.03 
Personal insurance, 


pensions 1.00 097 164 1.01 0.98 1.24 0.98 0.95 
Life, other personal 

insurance MOS LOZ SS TLO8S EOS e138 33) OL 
Pensions, social 

security 100 099 1.75 1.01 699 1.34 1.06 0.97 


without the losses incurred by some of the other weights for 
expenditures like Furniture, Personal insurance and pensions, 
and its sub-category Pensions and social security. 

Trellis plots (Cleveland 1993) of the cv and mean ratios for 
calwtsO and calwts! are given in Figures 2-4. Calwts0 is 
pictured because it is the nearest calibration equivalent to the 
current method of post-stratification. Calwts1 appears to be 
the best of the alternatives we have examined in the sense of 
improving the All Expenditures estimates while providing 
consistent performance for individual expenditure groups. In 
each panel of the plots a vertical reference line is drawn at 1, 
the point of equality between the calibration results and those 
for the PP method. The lower row in each plot presents ratios 
of means from calwtsO and calwts1 to the PP means and 
illustrates that with a few exceptions the levels of the means 
from the two restricted regression choices are about the same 
as from PP. 

The two calibration choices, in the main, improve cv’s 
compared to PP, i.e., cv ratios tend to be less than 1, for most 
domains and expenditures, and calwts1 is somewhat better 
than calwtsO. For the age-of-reference-person domains < 25 
and 65+, for example, 12 of the 15 expenditures have calwts1 
ratios of less than 1. For CU sizes 1-4 the numbers of cv ratios 
less than or equal to 1 are 12, 9, 9, and 11. There are 
exceptions, of course. For the South region only 6 of 
15 expenditures have calwts1 cv ratios less than or equal to 1. 

Calwts2 and calwts3, which used family income before 
taxes as one of the auxiliaries, had somewhat erratic perfor- 
mance for domains, sometimes making major improvements 
over PP but occasionally showing serious losses. This is 
connected to the nature of the family income variable itself. 
For the entire data set of 5156 CU’s, income before taxes was 
positive for 4698 CU’s, zero for 450 CU’s and negative for 
8 CU’s. The zeroes are incomplete income reporters while the 
negatives are for families that had business losses added to 
other income. In either case, these CU’s vitiate the usefulness 
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Figure 2. Ratios to PP of cv’s and means for two weighting methods by age of reference person 
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Figure 3. Ratios to PP of cv’s and means for two weighting methods by region 
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Figure 4. Ratios to PP of cv’s and means for two weighting methods by size of CU 
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of this variable in predicting expenditures. Perhaps, use of 
another measure of income combined with item imputations 
for missing incomes would improve calwts2 and calwts3 for 
domain estimation. 

Taking all of the above into consideration, regwts1, 
calwts1 and calwts4 are efficient choices in this application. 
Calwtsl has the advantage of non-negative weights over 
regwtsl. Since calwts4 requires 23 auxiliary variables as 
opposed to calwts1’s 18, calwts1 is the more parsimonious 
choice. Subsequent to the analysis discussed here, we 
performed a similar study using a full year’s data for both the 
Interview and Diary Surveys for 1990. Results were similar to 
those reported here and a final set of 24 auxiliaries was 
adopted based on number of persons by age, race, sex, region, 
urban x region, and number of CU’s by tenure, and an 
intercept. The conversion of CE estimation to restricted 
regression 1s now underway. 


4. CONCLUSION 


The objective of this study was to investigate methods for 
deriving household weights that did not depend on the weight 
of one single member of the household. Different types of 
weights based on the regression estimation procedure were 
presented and their relative merits evaluated. Regression 
estimation incorporates the current survey post-stratification 
methods in which the weighted sum of the persons in each 
post-stratum is forced to be equal to an independent census 
count of that number. This is accomplished via auxiliary 
variables that are incorporated into the regression model. It 
also automatically produces for each sample household a 
weight that does not depend on any single one of its members. 

We studied eight types of weights that came from five 
different regression models. In order to eliminate the 
undesirable negative weights that can result from ordinary 
least-squares regression estimation, restricted regression 
estimators were adapted to the present problem. Restricted 
regression has the flexibility to restrict the possible deviation 
of each final weight from its base weight while adhering to the 
properties discussed above. This, in particular, allows the 
constraint of positive weights. The restricted regression 
weights are easily computed via matrix-oriented software like 
S-Plus™ or SAS/IML™, 

Restricted regression, and more generally, restricted 
calibration have a number of attractive features for household 
surveys, like the one studied here, but also for surveys of other 
types of units like hospitals, schools, or business establish- 
ments where a variety of auxiliary data may be available. 
Given past data on target variables, standard model building 
procedures can be used for the selection of auxiliary variables. 
The properties of regression estimation can be used to choose 
the predictors optimally in order to reduce the redundancy of 
information that gets incorporated into the survey estimation 
procedure. This is one of the greatest advantages of using an 
estimator that has a vast and tested literature behind it. Good 


predictors may include qualitative variables, e.g., age, race, 
type of hospital (general medical, psychiatric, etc.), type of 
business (manufacturing, retail trade, etc.) that might be often 
used in stratification or post-stratification. The predictors can 
also be quantitative variables like family income, annual sales, 
number of students at different levels, or the number of 
inpatient days to name but a few. In our application, including 
an intercept also led to noticeably smaller standard errors of 
survey estimates. The regression approach also allows data at 
different levels to be easily incorporated in estimation. In the 
household survey studied here, auxiliaries on both persons and 
households were included. 

The immense flexibility of regression gives practitioners 
options they might not otherwise have. If new, pertinent 
predictor variables become available, software for regression 
estimation can accommodate them simply by changing the 
matrix of auxiliaries and vector of population controls. 
Software that is rigidly written to perform only post- 
stratification or ratio estimation with a single auxiliary, for 
example, might have to undergo a major overhaul to change 
the estimator. Of course, if the estimator is one of the less 
general post-stratification or the ratio types, regression 
software will often handle it as a special case. In the United 
States, an extremely large continuing household survey is 
being contemplated (Love, Alexander and Dalzell 1995) that 
will provide very precise estimates of many characteristics 
that may be used as control totals in smaller surveys. The 
restricted regression approach positions the CE Survey to 
smoothly incorporate such new data in estimation should it 
become available. 
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A Transformation Method for Finite Population Sampling 
Calibrated With Empirical Likelihood 


GEMAI CHEN and JIAHUA CHEN’ 


ABSTRACT 


In this paper, we study a confidence interval estimation method for a finite population average when some auxiliary 
information is available. As demonstrated by Royall and Cumberland in a series of empirical studies, naive use of existing 
methods to construct confidence intervals for population averages may result in very poor conditional coverage 
probabilities, conditional on the sample mean of the covariate. When this happens, we propose to transform the data to 
improve the precision of the normal approximation. The transformed data are then used to make inference on the original 
population average, and the auxiliary information is incorporated into the inference directly, or by calibration with empirical 
likelihood. Our approach is design-based. We apply our approach to six real populations and find that when transformation 
is needed, our approach performs well compared to the usual regression method. 


KEY WORDS: Finite population; Sampling; Confidence interval; Transformation; Empirical likelihood. 


1. INTRODUCTION 


Let (x;, y,), i= 1, 2, ..., N be values associated with N units 
in a finite population. For unit i, y, is the variable of interest 
and x, is an auxiliary variable. One of the most extensively 
studied finite population problems is the estimation of the 
population average y =(y, +... + yy)/N (or total Ny) under 
various sampling schemes. We shall focus on the simple 
random sampling scheme in this paper, because the nature of 
the problems we want to study can be better seen from this 
scheme and the results obtained here can be easily generalized 
into other sampling schemes of which the simple random 
sampling scheme is the building block. 

It is often true that some information about the auxiliary 
variable x is known and can be used to make inference about 
y. For example, let S = {1, ..., i, ..., N} and lets cS bea 
simple random sample of size n. When X = (x, + ... + Xy)/N is 
known, and x and y are correlated, the population average y 
can be estimated by the ratio estimator y = (y,/x,)x, or by 
the regression estimator y = ¥, + b(x - X,), where x, and y, 
are the sample averages of x and y, respectively, and 
P= G)—*,)0}-9,)/ LG, + %,): 

Under very general conditions, both the ratio estimator and 
the regression estimator are asymptotically normal; see Scott 
and Wu (1981), Bickel and Freedman (1984), and Theorem 2.1 
of Section 2. Hence, if v is a carefully chosen estimator of the 
variance of j, the standardized variable (¥- ¥)/yV is 
customarily treated to have the standard normal distribution. 
Therefore, if z, denotes the upper «-percentile of the standard 
normal distribution, then 


(5 -z, Vv, § +z, vv) (1.1) 


will produce an approximate 100 (1 — 2a)% confidence 
interval for y. 

Confidence interval (1.1) is widely used in practice. 
However, problems arise when it is applied to certain 
populations. Royall and Cumberland (1981a, 1981b, 1985) 
studied the ratio and regression estimators and applied them 
to six real populations where strong correlations between x 
and y seemed to exist. (See Section 3 for a summary of the six 
populations.) Various estimators of the variance of y were 
used. It was found that the actual conditional coverage rate of 
the confidence interval (1.1), conditional on <,, depended 
heavily on the size of x, and were usually much lower than 
the claimed coverage rate, even with the most adaptive 
variance estimator. For example, the 95% confidence interval 
for a population named Counties 70 had a conditional 
coverage rate 76% with the jackknife variance estimator when 
X, was small, and the conditional coverage rate could go as 
low as 50% with other variance estimators. 

The above mentioned studies point to the need to construct 
confidence intervals that “‘will live up to their name” (Royall 
and Cumberland 1985, p. 359). However, up to now there has 
been little progress made in this direction. In this paper, we 
present some results from studying an alternative procedure 
for constructing confidence intervals and from applying it to 
the six populations studied by Royall and Cumberland and 
many others. As will be shown in Section 3, the conditional 
coverage rate of our confidence intervals is more accurate. 

Two important ideas, namely, transformation and empirical 
likelihood, are used simultaneously to attack the problems 
encountered by Royall and Cumberland in particular, and to 
develop a new procedure in general. As explained in Cochran 
(1977, p. 150), the preference in sample survey theory is to 
make, at most, limited assumptions about the frequency 
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distribution followed by the data in the sample. However, 
ratio or regression estimator can help obtain increased pre- 
cision by taking advantage of the correlation between y, and 
x, This, of course, can be described by some assumption(s), 
such as an approximate linear relationship between y and x. 
Although almost no further assumptions are necessary to use 
the ratio or regression approach, the procedure (1.1) is clearly 
based on an normal approximation. But as it is well known, 
the normal approximation can be very poor when the 
population distribution is severely skewed and the sample size 
is small. In terms of procedure (1.1), the closer the estimator 
distribution is to the normal, the better one can construct 
confidence intervals. If the population distribution is severely 
skewed, a transformation may produce a population distri- 
bution that is at least more symmetric, so that the normal 
approximation for the estimator is more accurate. 

When using the ratio and regression estimators, knowing X 
is crucial to gain improvement over the use of sample mean. 
In our proposed procedure, the complete information about 
the auxiliary variable x can be incorporated. But if x is the 
only auxiliary information available, it is difficult to use this 
information directly when a transformation is involved, 
because any non-linear transformation obscures the link 
between x and jy. In this second case, we find the method of 
empirical likelihood very helpful in solving our problem; see 
particularly Owen (1988, 1990) and Chen and Qin (1992) for 
references. The empirical likelihood method in this situation 
can also be regarded as a calibration method as discussed in 
Deville and Sarndal (1992). This approach rescues us from 
losing information about x after transforming the data. 

There have been many discussions on how to use transfor- 
mations to make better inference on the transformed scale 
(Box and Cox 1964; Carroll and Ruppert 1988; Calvin and 
Sedransk 1991, and the references therein). There have also 
been some studies on how to make inference on the original 
scale, after a transformation is applied (Carroll and Ruppert 
1984; Elliott 1977). What is new with our procedure is the 
attempt to link the above two steps by combining transfor- 
mation with auxiliary information and/or by applying 
empirical likelihood method when necessary. 

The details of our procedure are given in Section 2. Then 
our procedure is applied to the six populations studied by 
Royall and Cumberland in Section 3. The validity of our 
procedure in an arbitrary setting is demonstrated in Section 4 
and some comments are made at the end of the paper. 


2. THE NEW PROCEDURE 


As mentioned in the last section, a problem with the 
confidence interval (1.1) is that it will fail if the distribution 
of (¥ - y )/yv is severely asymmetric and far from the normal 
distribution. The problem can be inherited from the skewness 
of the population distribution. When the skewness is severe, 
a central confidence interval procedure like (1.1) is doomed 
to fail. The basic model employed by Royall and Cumberland 
(1981a, 1981b, 1985) is 


y,=%+ Bx, +, (2.1) 


with E(e,) = 0, V(e,) = 0? and Cov(e;,€;) = 0, for i # j. It is 
easy to find that for the six real populations studied by Royall 
and Cumberland, the corresponding error distributions are 
very skewed. These observations lead us to consider 
transforming the variables y and/or x, and consider the model 


hy) =o +B eG) +o€.,, (2.2) 


where h(-) and g(-) are two monotone functions. There are 
many families of transformations suggested in the literature. 
One commonly used family is the Box-Cox power transfor- 
mation family defined by 


(x*- 1)/A when A +0, 


Me? (ee when A =0. 
Model (2.1) is a special case of (2.2) when both h and g equal 
Ff (x, 1). 

The choice of transformations in model (2.2) might be 
suggested by an examination of the sample x's and y's based 
on a possible model relationship, or by our subject knowledge 
about the population under investigation. For example, for the 
six populations discussed in Royall and Cumberland, the 
population distributions are severely skewed towards the right 
which can be learned from the nature of the finite popu- 
lations. Therefore, a log transformation may make them all 
less skewed. Other more objective methods of choosing 
transformations are discussed in Section 4. 

We emphasize that models (2.1) and (2.2) are used here to 
motivate transformations, point estimators, or confidence 
interval procedures. Our study of conditional coverage rates 
will, however, be based on the probability measure generated 
by the design, as in Royall and Cumberland (1985). For this 
purpose, we embed our finite population in a sequence of 
populations indexed by k. This means that a sub-index k is 
needed to write N = N, and n = n,, etc., but for simplicity, 
we will suppress the index k if there is no possibility for 
confusion. 

Let v, = hG,), m4 = g(x), ¥y=NYiiv,. and 


1 


iy = N'Y” \u,. Define 
aC ~ Uy); 


a (u; - Uy)” . 


By = 


Oy = Vy ~ By Ay, 


€;=V,~ (hy + By), 
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Suppose s c S is a simple random sample of size n. We 
similarly define 


6 = Dies (4; ~ 44); 

Vics Yj 7 a.) 
&=0,-Ba,, 
o-|_ Yw,-a@- fu)’, 


where i, and ¥, are the sample averages. 


Denote the inverse function of h(-) by h''(-). Then the 
fitted value of y; is 


§,=h (+B u). (2.3) 


We discuss confidence interval estimation of y in two cases. 
In the first case where all x; (i = 1, ..., N) are known, a natural 
estimator of y is ()),.,y;+ Lics¥)/N. However, for the 
purpose of constructing confidence intervals for y, we study 
the distribution of 
N co 

§(@,B)-— V9,-[ hn 'a@+budryw (2.4) 
i=1 
instead, where F’, (u) is the empirical distribution function of 
the u; (i = 1, ..., N). Clearly, the distribution of ¥ (a, 6) is 
determined by the distribution of (é@, 6) which is descibed in 
the following design-based theorem. 


Theorem 2.1 Suppose that when k - ~, both n = n, and 
N-n=N,-n, go to ~ and 


1. @=lim,.N'YN, u, exists. 

ON a isl). 

3a. = lin, 67, = limp (V1) Pa, ay) “exists 
and is greater than zero. 

Avge lim, ,.0y,= lim, _.(Na1)," 5 
greater than zero. 

5. NUN, lel = OC), N*Yi4 [u;- @y) e,P = OC). 

6. r=lim, (cy Oy) N'Y) \(u,- Hy)? e; exists and is 
greater than zero. 


NY 2 
é 


;-1€; exists and is 


7. f=lim,_.n/N exists and is less than 1. 
Then 


(1) yn(&- tee 6 - B.,)’ converges in distribution to the 
bivariate normal distribution N, (0,¥.), where 


Th 7 
1+—r -—r 
2) 
0, 
Sa — (1-f) 0. 
u 
-—r —r 
2 2 
0, 0, 
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(2) Let B, be any joint 100(1 — y)% confidence region for 
(ay, By) and define G,, by 


G, = {¥ («,B) : (,8) € B,}, (2.5) 
then, 
Prob {¥ (ay, By)€G,}21-y, 


where ¥ (ay, By) =D | (ty + By u,)/N. 
The proof is deferred to the Appendix. 


We note that without underlying normality on the errors, 
it is not easy to get an exact confidence region B, for (ay, By) 
for a specified confidence level 1 — y. The B, used in the 
following discussion and the expressions built upon it are, 
therefore, approximate. 

Theorem 2.1 allows us to construct confidence intervals 
for ¥(a,,B,y), but ¥(%,,B,,) is not equal to y in general. This 
is an intrinsic problem as long as a non-linear transformation 
is used. If only a point estimator is needed, we would use the 
regression estimator currently, and we suggest that the 
methodology developed in this paper be used for interval 
estimation. Bias corrections for y(@, B) are, however, 
possible, and a specific one is used in our simulation study. 
Work on general corrections is under study. 

According to Theorem 2.1, G,, is a conservative confidence 
interval for y(a,,B,), which can also be regarded as an 
approximate confidence interval for y. To improve the 
coverage rate of G,, observe that the contours of ¥ («,f) in 
a small neighborhood of O = (&, 6) are approximately parallel 
straight lines on the wf plane; see Figure 1. Let (a, b) be the 


Beta 


0.8 0.9 1.0 


eudiy 


Figure 1. Contour plot of the bi-variate function y (a, f) in the 
neighbourhood of O = (@,[), based on a random 
sample of size 32 taken from population Cancer 
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directional cosines of the direction EF along which the 
contours increase. Then ¥(a,8) is (approximately) a 
monotone function of T, = a(a - &) + b(B - 6), where TAs 
the corresponding change along the direction EF to the 
changes in « and B. A natural choice of B, is 
B, = {(a,B) : |a(a - &) + b(B - B)| < cét(y/2;n - 2)}, 

where c* = Var(T,)/o’, Var(T,) is the variance of T,, and 
t(y/2; n — 2) is the upper y/2-percentile of the ¢ distribution 
with n — 2 degrees of freedom. This B, is the region between 
two parallel straight lines AB and CD in Figure 1. 

A drawback of the above B, is that it is an unbounded 
region. If the contours of ¥ (a, ) are not close to be parallel 
and/or straight, this B, will lead to very conservative 
confidence intervals. To guard against this possibility, we 
construct a bounded elliptic region C,, defined by those (a, B) 
that satisfy 


{rn - @) + 2nu,(a- &)(B -B) + 
ofr yeu aca fs 3 


(1 - ac 67t? (y/2;n - 2), 


where (1 — n/N) is part of the variances of & and B, because 
we are doing sampling without replacement from a finite 
population, and 


= = RY) A A %) 
: nV iesUj ~ Hy) (y; ~ & - Bu,) 


* {nV (u, - ty) }{(n- 2)Y,.,,- @- Bu,)?} 
(2.6) 


is a sample estimate of the quantity r in Theorem 2.1. The C, 
thus defined is represented by the region inside the ellipse in 
Figure | and has the property that it touches both boundary 
lines of B, regardless of the direction (a, b). Therefore, when 
y (a, B) is indeed a monotone function of T,, C, produces the 
same confidence interval for y as B, does. However, C, is less 
vulnerable than B, if the contours of y(«,f) are not close to 
be parallel and/or straight, because C, shrinks to one point as 
n increases. A confidence interval for ¥ corresponding to C, 
is defined as 


= {¥ (a,B) : («,B) eC}. (2.7) 

As the error distributions are more symmetric after the 
transformation, the new confidence interval based on C, is 
therefore expected to be better than the confidence interval 
without transformation. Note that since all x; are known, 
other approaches, such as optimal stratification and post- 
stratification, may be better. However, optimal stratification 


may not be possible in some cases as discussed in Cochran 
(1977, p. 134). Also research is needed on the use of post- 
stratification when the error distributions are severely skewed. 

We now turn to the discussion of the second case where 

= (x, + ... + x,)/N is known, but x, i = 1, ..., N, are 
unknown. If we want to proceed as in the first case, one 
approach is to estimate F,(u) and somehow make use of the 
information in x. The following empirical likelihood 
methodology is found to be an effective way of doing this. 
We outline the main ideas here; the interested reader should 
consult Owen (1988, 1990) and Chen and Qin (1992) for 
more details. The key idea is to maximize the (empirical) 
likelihood functions under various restrictions formed by the 
knowledge about some aspects of the parameters. For 
example, in our problem, the knowledge is x. It is shown by 
Chen and Qin (1992) that the resulting estimators with the 
presence of restrictions are asymptotically more efficient than 
those without restrictions. 

Specifically, we estimate F,(u) in (2.4) by 


Fy) =) pT lu; < ul, (2.8) 


ies 
where the p, are chosen by maximizing 


IIe, (2.9) 


ies 
subject to 


p,29, eae Dae 


ies ies 


(2.10) 


If y,, i € s are regarded as realizations of the random variables 
Y,, 1 € s, with distribution function F, the p; in (2.9) can be 
defined by p; = F(Y,) — F(Y,-), and (2.9) is called the 
empirical likelihood function in Owen (1990). 

Deville and Sarndal (1992) look at the above approach 
from a calibration point of view. They suggest using unequal 
weights for different units sampled to reflect their different 
contributions, while keeping )) p,x; = X. It is believed that if 
these weights give a perfect estimate of x, they should also be 
good for estimating y. 

The solution to (2.9) and (2.10) will not exist if either the 
minimum x value in a sample is greater than or equal to x, or 
the maximum x value in a sample is less than or equal to x. 
When this happens, one remedy is to replace (2.9) with 


Da ipeel), (2.11) 


i€s 


subject to a milder constraint 


pk Pi; = (202) 


ies ies 


Under (2.11) and (2.12), we have 


pad @-3)G,-%)/O@-%7, 213) 


1€S 
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which always exists unless all the x; in the sample are the 
same. The latter situation corresponds to the lack of a 
covariate, which implies p; = n"' if x = x,, or the solution does 
not exist if x x; The function given in (2.11) is called the 
Euclidean likelihood, which is asymptotically equivalent to 
the empirical likelihood (2.9) (Owen 1990). 

For our simulation study in Section 3, we suggest a bias 
correction to be used in our computation. If h(w) = g(w) = 
log(w), we suggest a corrected estimator of y as 


>*(&,B) = [exo(e Bu, +> oh Fy (2.14) 


if all u;, 1 = 1, ..., N are known, and replace Fy(“) by Fy (u) 
and i, in (2.6) by 7 ,when only x is known. This correction 
is motivated by model-based considerations under a normality 
assumption. Correspondingly, /,, of (2.7) is corrected as 


150 
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ie = {y* (a,B) : (a,B) € CY. (2.15) 
When other power transformations are used, similar correc- 
tions can be made using the results in Pankratz and Dudley 
(1987). 


3. APPLICATION TO SIX REAL 
POPULATIONS 


The six real populations studied by Royall and 
Cumberland (1981a, 1981b, 1985) are summarized in Table 1. 
Attention was given to the variety in the type of data 
(demographic, economic, efc.), and in the logical relationship 
between the x and y variables, when these populations were 
chosen. Note that we have added 1 to the y values in 
population Cancer in order to take the log transformation. 


40 


30 


log( Y) 


6 7 8 9 10 11 
log (X) 


Figure 2. Histograms and scatter plots for the population Counties 70 before and after taking the log transformation 


144 Chen and Chen: A Transformation Method for Finite Population Sampling Calibrated With Empirical Likelihood 


Table 2 
Simulation results based on 10,000 simple random 


Table 1 
Summaries of the Six Populations 

Population N ae y p(x y) Plog), 
log(y)) 

Cancer 301 1.1288x10* 4.0847x10' 0.967 0.948 

Cities 125 2.6602x10° 2.8553x10° 0.947 0.953 

Counties60 304 8.9312x10? 3.2916x10* 0.998 0.998 

Counties70 304 8.9312x10? 3.6984x10* 0.982 0.991 

Hospitals 393 2.7470x 10? 8.1465x10? 0.911 0.943 

Sales 331 2.3164 10° 2.4078x10° 0.997 0.985 


samples of size 32 


The Counties 70 data are plotted in Figure 2. The histogram 
of y clearly indicates that the population distribution is 
severely skewed, while the same plot for log(y) shows a 
substantial improvement. Also, the scatter plot of log(y) vs. 
log(x) shows a better linear relationship than the scatter plot 
of y vs. x. The need and the benefit of taking transformation 
is therefore obvious. Similar comments can also be made for 
populations Cities, Counties 60 and Hospitals. For popu- 
lations Cancer and Sales, the log transformation (or any other 
power transformations) seem to weaken the linear relationship 
that exists between x and y. 

Now, we illustrate our new procedure by assuming 
h = g = log in (2.2). Equations (2.9) to (2.15) are used to 
perform the calculations. As in Royall and Cumberland 
(1981b, 1985), for each of the six populations, we take a 
simple random sample of size 32 and calculate hae y* (4, B) 
and construct a 95% confidence interval /,,.We repeat this 
process 10,000 times for each population. The results are 
reported in Table 2 under the title “Transformation Method” 
when all x values are known, and under the title “Empirical 
Likelihood Method” when only x is known. The term ratio 
denotes the average length of the confidence intervals divided 
by the root mean square error for each population. The non- 
coverage rate (Ncr) is the proportion of intervals that fail 
to contain the population average y. The quantities under 
the titles “Regression Method (regression variance)” and 
“Regression Method (jackknife variance)” are obtained using 
the same method of Royall and Cumberland (1981b) when the 
usual regression variance and the jackknife variance of ¥ are 
used, respectively, but for 10,000 random samples instead of 
the original 1,000 samples. The results under “Empirical 
Likelihood Method (created population)” are to be explained 
in the next Section. 

Next, we follow Royall and Cumberland to make design 
based inference and to study the conditional coverage pro- 
perties of several interval estimation procedures. Specifically, 
we divide the confidence intervals into 20 groups according 
to the size of x,, and plot the proportions of intervals in each 
group that fail to contain the population average y. For each 
specific group, the proportion of those intervals that lay above 
(below) y is plotted above (below) the horizontal line. 
Figure 3 contains such plots for the Counties 70 data. The top 
two plots show the non-coverage rates of the regression 
method using the usual regression variance and the jackknife 


Cancer Cities Counties60 Counties70 Hospitals Sales 
Regression Method (regression variance) 
Ratio 3.26 3.65 3.05 2.90 3.62 2.94 
Ner 0.141 0.116 0.146 0.271 0.098 0.176 
Regression Method (jackknife variance) 
Ratio 4.03 3.88 4.03 3.57 3:93 3.95 
Ner 0.081 0.102 0.083 0.192 0.068 0.079 
Transformation Method (all x values are known) 
Ratio 5.08 4.00 B15) 3.76 4.04 5.41 
Ner 0.018 0.074 0.053 0.069 0.042 0.001 
Empirical Likelihood Method (only x is known) 
Ratio DyllP2 3.74 3.37 3.69 4.15 4.90 
Ner 0.017 0.082 0.081 0.082 0.037 0.006 
Empirical Likelihood Method (created population) 
Ratio 3.92 BioZ 3.97 3.96 3.90 3.99 
Ner 0.057 0.059 0.055 0.058 0.059 0.059 


variance for y; the middle two plots show the non-coverage 
rates of our new procedure. The bottom left plot will be 
explained in Section 4. As can be seen clearly, our new 
procedure with a log transformation produces substantial 
improvement. For populations Cities, Counties 60 and 
Hospitals, our new procedure also produces some improve- 
ment (plots are not shown here). For populations Cancer and 
Sales, the new procedure produces very conservative results. 
This is likely due to the fact that the log transformation (or 
any power transformation) actually weakens the linear 
relationship between x and y. 

We have also performed simulations for sample sizes 16 
and 64, and/or for target coverage rate 90%. The results are 
very similar to what we have presented. 


4. DISCUSSION 


We use the log transformation in some of our discussions 
because it is perhaps the most frequently used transformation 
in practice. Nevertheless, there exist more objective methods 
to select transformations. One such a method is the well known 
Box-Cox power transformation which we have mentioned; 
see Box and Cox (1964), Box and Tidwell (1962), Carroll 
and Ruppert (1988). Another recent method is based on a 
procedure called alternating conditional expectation (ACE) 
(Breiman and Friedman 1985, De Veaux and Steele 1989). 

There are other possibilities to improve conditional cov- 
erage rate. One such a possibility is to employ asymmetrical 
error distributions such as the inverse Gaussian family 
(Whitmore 1983). Another possibility is to adopt quasi- 
likelihood (Nelder and Pregibon 1987) to finite population 
problems. 
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Figure 3. Plots of conditional non-coverage rates for the population Counties 70 based on 10,000 simple random samples of size 32. Reference 
lines are drawn at 2.5% and the expected non-coverage rate is 5% 


The validity of our new procedure is also demonstrated in 
the following simulation study. For each of the six real popu- 
lations, we create a new population by replacing the original 
y,; values with 


y; = exp{é + B log(x,) 6¢,}, 
where &, Band 6 are the parameter estimates from fitting 


model (2.2) with h = g = log to the old population, and €; are 
generated as i.i.d. standard normal variates. Using the six 


created populations which are fixed, we repeat the simula- 
tions as in Section 3 for the case where only x is known. 
Table 2 contains the summary of this simulation study, and 
the non-coverage plot for the Counties 70 data is shown at the 
bottom left corner of Figure 3. (Non-coverage plots for other 
populations look very similar to this plot.) It is clear from this 
study that when the finite population is generated from a 
super-population model like (2.2) with a normal error distri- 
bution, our new procedure gives the correct conditional cover- 
age rates. Furthermore, we decrease the correlation between 
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x and y to as low as 0.5 for each of the six populations by 
increasing 6 and repeat the above simulations. The results are 
as good as those shown in Table 2 and Figure 3. 

Although only the simple random sampling scheme is 
considered in this paper, the proposed procedure is applic- 
able as long as (i) there is a linear correlation between h(y) 
and g(x) for some monotone functions h and g, and (ii) either 
F,(u) or F, (u) can be found. Since the six populations 
studied here are carefully chosen to be representative, our 
new procedure is expected to be useful to study other finite 
populations. 
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APPENDIX 


Proof of Theorem 2.1 (1). 
and t,, we have 


For any given real numbers f, 


t,(& - a) + t,(B - vee 


ep es + ———__—__ ee We. 
és om = i€s 


From Conditions 1, 2 and 3, we have 


Therefore, we can write 
t, (é a oy) a t,(B =a i. 7 


aa Yoox ol Sn = aye, +o(n Pye 


ies o ies 
u 


The Lindeberg- Hajek condition is satisfied for te, + 
1. - tala. (u;- %)e, under the moment condition 5, see 
Hajek (1960), Scott and Wu (1981) and Bickel and Freedman 
(1984). Together with Conditions 4, 6 and 7, the desired 
result follows by using the Cramér-Wold device. 


Proof of Theorem 2.1 (2). Because there may be other 
values (a’,B’) € B, for which y(a’,B’) = ¥(a,B) for some 
(a,B)¢€ BG, is always conservative. 
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The Application of McNemar Tests to the Current 
Population Survey’s Split Panel Study 


KATHERINE JENNY THOMPSON and ROBIN FISHER’ 


ABSTRACT 


Results from the Current Population Survey split panel studies indicated a centralized computer-assisted telephone 
interviewing (CATT) effect on labor force estimates. One hypothesis is that the CATI interviewing increased the probability 
of respondent’s changing their reported labor force status. The two sample McNemar test is appropriate for testing this type 
of hypothesis: the hypothesis of interest is that the marginal changes in each of two independent sample’s tables are equal. 
We show two adaptations of this test to complex survey data, along with applications from the Current Population Survey’s 
Parallel Survey split panel data and from the Current Population Survey’s CATI Phase-in data. 


KEY WORDS: Current Population Survey; Parallel survey; Nonparametric statistics. 


1. INTRODUCTION 


Results from the Current Population Survey’s Parallel 
Survey split panel study and from the Current Population 
Survey’s CATI Phase-in Project provided some indication of 
a centralized computer-assisted telephone interviewing 
(CATI) effect on the United States’ monthly labor force 
estimates (Thompson 1994 and Shoemaker 1993). One 
hypothesis is that the CATI interviewing increased the 
probability of respondent’s changing their reported labor 
force status from the first (personal) interview to the second 
(CATI) interview. 

The two sample McNemar test is appropriate for testing 
this type of hypothesis. The McNemar test (1947) has been 
generalized to a two sample situation where the hypothesis of 
interest is that the marginal changes in each of two 
independent samples’ 2 x 2 tables are equal (Feuer and 
Kessler 1989). The application presented was for a two 
sample cohort analysis and assumed simple random sampling. 

Certain modifications of the test statistic for a McNemar 
test are necessary for a complex survey data application. First, 
because the data are not obtained through a simple random 
sample and are weighted, a separate estimate of the variance 
is required. Second, unless the survey has a longitudinal 
design, a separate link of individuals in two consecutive 
months’ of data must be performed. In general, such a link 
will include some false matches and exclude some true 
matches. This adds another source of variance. 

We show two adaptations of this test to complex survey 
data. In particular, we present these tests along with 
applications to the Current Population Survey’s Parallel 
Survey split panel study and from the Current Population 
Survey’s CATI Phase-in Project. In Section 2 we describe 
these test modifications including background on the one and 
two-sample McNemar tests (Section 2.1), modifications for 


complex survey data (Section 2.2), and some remarks on 
applications to several months’ data (Section 2.3). Section 3 
presents applications of these tests specifically to Current 
Population Survey Parallel Survey Data and to Current 
Population Survey CATI Phase-in data including background 
on the two studies (Section 3.1), details of the panel estimates 
and variance estimates (Section 3.2), diagnostics 
(Section 3.3), and results (Section 3.4). We make some 
concluding remarks in Section 4. Details of covariance 
estimation are included in the appendix. 


2. TEST AND MODIFICATIONS 


2.1 General 


A sample is randomly split into two independent 
representative samples (split panels). After a baseline 
measurement is taken in both panels, a new technique is 
administered in one panel, the treatment panel. The other 
panel serves as a control. 

The records are linked longitudinally after the second 
measured. A matched response can be +, -, or * (missing). 
Since this is matched data, the “**” cell will be empty. 


This scenario is represented pictorially as 


Treatment Panel 
Month 2 
Treatment 


Month 1 


No Treatment 


' Katherine Jenny Thompson, Economic Statistical Methods and Programming Division, and Robin Fisher, Housing and Household Economic Statistics 
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Control Panel 
Month 2 
No Treatment 


Month 1 x) 
No Treatment Sa 
x 


where n is not necessarily equal to n’. 


For each panel, define 


M (12 as the set of cases which have month | and month 2 
responses (matched cases). This set contains nj) = 


(oe emt tenes Rerenienis: 


M 9) a8 the set of cases which have month 1 responses, 
but no month 2 response. This set contains 
Nao = %,,. + x_,) elements; 


M as the set of cases which have month 2 responses, but 
no month | response. This set contains ng») = (x,, + X,_) 
elements. 


Note that the n’s are sample sizes and do not have weights. 


First, consider the one-sample case. Traditionally, the one- 
sample McNemar test statistic is constructed from the n,,» and 
N12) Matched responses, where a prime (’) indicates the 
control panel. In the one-sample scenario, we test the 
hypothesis 


Hy: p,. = p-,, where the p’s refer to cell probabilities 
H,: Not Hy 


i.e., the hypothesis that the movement from one state to the 
other (+ to -, or - to +) is zero. We also refer to this 
movement as the flux. 

The one-sample test can be a useful diagnostic in the two- 
sample situation. We examine the Control panel estimates to 
see if there is zero movement. Any significant movement in 
the Treatment panel can be measured as a deviation from zero 
flux or as a change in the probability of a “+.” 


The two-sample hypothesis is 


A: (poe De) SANDS =ipes) 
H): Not H,. 


In other words, the difference in the probabilities of switching 
in the two directions is the same, regardless of the treatment, 
or equivalently, the difference in panel fluxes is zero. 

The Feuer and Kessler generalization (1989) to a two- 
sample McNemar test (described in 2.2.1 below) is confined 
to the M/,,. and M_,,) linked sets. We can add an additional 
assumption, however, to allow the unmatched responses to be 
included in computation of the test statistics. This assumption 
motivates the discussion in Section 2.2.2. 


2.2 Complex Survey Modifications 


2.2.1 Modification One: Longitudinally Linked Data 


This method is a straightforward application of the two- 
sample McNemar test, using longitudinally linked data from 
a complex survey. 

To construct the test statistic, we examine the cell 
probabilities and note that 


[Perceeizeh Dt) a= LGD aint Vaart Da oa) 
=ipapal 
Pe Py 


where p> is the marginal probability of a + response month 2, 
given a matched response for both months; and pf is the 
marginal probability of a + response month 1, given a 
matched response for both months. 

The one-sample test statistic constructed from this panel’s 
data is 


ps -p° 
Zt 2 2 1 
VVar(p5 - Pp?) 
where 
X,, i Nees ° Je ine, Sa 
Litea 2 La 
M12) N12) 


Given two independent panels, the two-sample test statistic is 


(Py - Py) - (Py - PY? 


Tha test nee OEE eae 
(Var@3 — p3) + Var(py — 71°) 
where 
x! + x’ 4 + x’ 
OMe ++ +H Owe + -—+ 
P, ; ’ P» - i; 
M12) LG} 


These results hold regardless of sample design. To extend 
the results to a complex survey application, we use weighted 
estimates and complex survey variances and covariances in 
place of simple random sample variances. 

If the survey is designed to collect longitudinal data, then 
this modification is a natural extension of the method described 
by Feuer and Kessler. For this type of survey design, an 
effective mechanism to link individuals from month to month 
is presumably in place. Often, however, this is not the case, 
and one data set must be physically linked to another. Conse- 
quently, the n,,.. elements in the domain will contain some 
false matches, and some actual matches may be inadvertently 
excluded. Both the record weights and variance estimates will 
need to be adjusted to account for the matching. Jabine and 
Scheuren (1986) provide an excellent summary of the method- 
ological issues arising from the use of linked data, both for 
model-based and ad-hoc (“hard’’) record linkage techniques. 
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2.2.2 Modification Two: Unlinked Data 


This method omits the longitudinal linkage step altogether, 
noting that the construction of the traditional McNemar test 
statistic can be expressed in terms of estimates of marginal 
probabilities. Assume that under the null hypothesis, the 
expected value of (p, - p_,)is zero, This is described for a 
simple random sampling application in Marascuilo et al. 
(1988). 

The domain for the first month of data is given 
M 42) ¥ Mao Which contains nq) + Ngo, = n, elements. The 
domain for the second month of data is given by M (2) UM, 
which contains ny, + No, = n, elements. 

The one-sample test statistic constructed from the unlinked 
data is given by 


(02) 


ye Bota 
y Var(p, De P,) 
where 
a Xx. 
P, = 5 Po =e 
ny Ny 


Given two independent panels, the two-sample test statistic is 


ee (Dre? a0, 7Pp) : 


y Var, - p,) + Var, - p;) 


where 


As with the application described in 2.2.1, all estimates are 
weighted estimates, and variances are complex survey 
variances. 


2.3. Linear Combinations 


We can use our estimated covariance matrix to test linear 
combinations of A,, A,., and 6 over time, where A Pep Drs 
Ao =p, PP , and 6 =i, re Res and 2 Br p, and D> are 
vectors containing the marginal probabilities for the time 
period under consideration. 

General linear hypotheses of the form Kp are now easily 
tested. One might wish to test for contrast by time period, for 
example testing the average difference from January through 
June against the remainder of the year’s data. Perhaps the 
most interesting (to our applications) of these tests is of the 
hypothesis H,: 1’ =0, where p is the expected value of one 
of the vectors described above. 

Another test of particular interest is the “omnibus 
hypothesis,” where we test Hy: uw = 0. The test statistics for 
this hypothesis are A,” Liar ACLiio2#e and A; Lie As, 
each of which has an approximate chi-squared distribution 
with r degrees of freedom, where r is the dimension of the 
vector of interest. 
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3. APPLICATIONS 


In this section, we apply the one and two-sample 
McNemar techniques for unlinked data outlined in 2.2.2 and 
2.3 to two separate sets of data: the Current Population 
Survey’s Parallel Survey split panel data and Current 
Population Survey CATI Phase-in data. Tables 1 and 2 
(section 3.4.1) provide the results for Parallel Survey split 
panel data. Tables 3 and 4 (section 3.4.2) provide the results 
for the Current Population Survey CATI Phase-in data. 


3.1 Background 


The official monthly civilian labor force estimates from 
January 1994 onward are based on data from a compre- 
hensively redesigned Current Population Survey. The redesign 
included implementation of a new, fully computerized 
questionnaire, and an increase in centralized computer- 
assisted telephone interviewing (CATI). To gauge the effect 
of the Current Population Survey redesign on published 
estimates, a Parallel Survey was conducted using the new 
questionnaire and data collection procedures from July 1992 
through December 1993. Special studies were embedded in 
both the Parallel Survey and the Current Population Survey 
during the same time period to provide data for testing 
hypotheses about the effects of the new methodological 
differences on labor force estimates: the Parallel Survey split 
panel study and the Current Population Survey CATI Phase- 
in Project (a continuation of the study presented in 
Shoemaker 1993). 

The effect of increased centralized computer-assisted 
telephone interviewing was of particular interest. Findings 
from the study described in Shoemaker (1993) had shown that 
including centralized telephone interviews tended to yield a 
larger unemployment rate. The two-sample McNemar test 
appeared to be a good vehicle for examining this pheno- 
menon. In both the Current Population Survey and the 
Parallel Survey, households are interviewed for 4 consecutive 
months, not interviewed for the next 8 consecutive months, 
and then interviewed for another 4 consecutive months. The 
first and fifth interviews are conducted by a personal visit, 
and the subsequent interviews are conducted by telephone 
whenever possible. Thus the first and fifth interviews provide 
a baseline measurement of labor force status; the second and 
sixth interviews provide a “post-treatment” measurement of 
labor force status. 

To create the panels for both studies, sample within 
selected sample areas was randomly divided into two repre- 
sentative panels using systematic sampling methods. The 
treatment panel was designated as CATI eligible. This meant 
that the sample households in the panel were eligible for 
interview at a centralized facility after the initial (first and 
fifth) interviews. To be interviewed by CATI, a respondent 
must have a telephone and speak English or Spanish, and 
must agree to be interviewed in subsequent months by 
telephone. Not all households in this panel were interviewed 
by CATI. The other panel served as a control. 
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The monthly unemployment rate is the primary statistic of 
interest published from Current Population Survey data. This 
rate is defined as the estimated number of unemployed 
persons divided by the estimated number of persons in the 
civilian labor force (the denominator does not include military 
personnel, persons under sixteen years old, or people who are 
no longer looking for work, or retired persons). Our primary 
goal was to understand how including CATI interviews 
influenced the probability of changing labor force status, in 
this case from unemployed to not unemployed (or vice versa). 
Our statistics for the one and two-sample McNemar tests used 
unemployment to population ratios, rather than unem- 
ployment rates. This allowed for a slightly more precise 
estimate of the proportion by decreasing the variability of the 
test statistic. 


3.2 Estimates 


Each month/panel estimate is an unbiased estimate. That 
is, the weights used to produce the estimates were strictly a 
function of the probability of selection: each weight is the 
product of the baseweight (the inverse probability of selection 
for a PSU), the weighting control factor (an adjustment for 
field subsampling), and a split panel factor (an adjustement 
for the probability of inclusion in a split panel). The split 
panel factor for the Parallel Survey study was constant by 
design: nine tenths of the sample was randomly assigned to 
the treatment panel. The split panel factors for the CPS CATI 
Phase-in were not constant: the sample in the treatment panel 
varied on a monthly level, as more sample was randomly 
assigned to CATI facilities. 

Variances of levels were computed with generalized 
variance functions (GVFs). For more details, see Fisher et all. 
(1993). Robert Fay used his VPLX software (Fay 1990) to 
calculate replicate estimates of correlation between rotation 
groups for unemployed and for civilian labor force using 
September 1992 through December 1993 data from the 
Current Population Survey. We used these correlations for the 
test statistics based on unlinked data, assuming that they 
would not differ by survey (Current Population Survey versus 
Parallel Survey) or by geography (national versus sub- 
national). We derived an expression for the within-panel 
correlation for civilian population by relating previously 
calculated autocorrelations (Fisher and McGuinness 1993) 
and variance estimates to the individual rotation group 
estimates. See the appendix for details of the estimation of the 
correlations. 

We did not use the linked modification in our applications 
for several reasons. The primary reason was the difficulty of 
longitundinally matching the data. Moreover, we were unable 
to evaluate the success of our matching. Finally, we did not 
have any estimates of correlation for the linked data. 

Implicit in our analysis of the unlinked data is the 
assumption that the probability of a nonresponse (or a non- 
match) is random. We assume that the probability of a 
nonresponse one month is independent of the respondent’s 


labor force classification in the previous month. This assump- 
tion is not universally recognized. In fact, Stasny and 
Fienberg (1984) argue the reverse, and propose several 
alternative discrete-time models for the use of longitudinally 
linked CPS data. In our application, the estimates of marginal 
probabilities based on our (perhaps) poorly matched linked 
data were almost identical to the estimates based on unlinked 
data, and so we feel that our analysis did not suffer 
particularly from our assumption. 


3.3 Diagnostics 


Small expected sample sizes in individual cells will result 
in highly variable and consequently unreliable tests. We are 
not aware of a general method of calculating adequate sample 
sizes for this type of analysis using complex survey data. 
Instead, as a naive approach we used a slightly modified 
version of the traditional Pearson chi-squared test diagnostic 
to form a cut-off value as follows: 


As defined in Section 2.2.2, let 


x, = unweighted unemployed persons in month 1; 
x__ = unweighted not-unemployed persons in month 1; 
x , =unweighted unemployed persons in month 2; 
x _ = unweighted not-unemployed persons in month 2. 


Recall that in the case of the usual contingency table, E[+-] = 
AOxD / Nays | iia oe, / Nay) under the assumption of 
independence (and ignoring missing values). In our estimates 
of expected cell size, we used unlinked marginal data. The 
sample sizes for the two margins corresponding to the two 
months are different; that is, the denominators of the expected 
cell totals are different depending on which margin we 
examine. Because we could not observe n,,»,, we estimated it 
by the geometric mean of n, and n,, which seemed to most 
closely resemble the expression for the expected cell size. We 
have not evaluated the effectiveness of the geometric mean 
versus alternative estimators. 

A commonly used rule in contingency table analysis is that 
expected cell sizes should be at least five. However, both the 
Current Population Survey and Parallel Survey designs are 
highly clustered, and we felt that the cut-off value should be 
adjusted upwards. Accordingly, we multiplied the cut-off 
value by a design effect. We further increased the cut-off 
value for expected cell sizes to compensate for the correlation 
between the rows and columns of our tables to arrive at our 
final cut-off expected cell size of ten. 


3.4 Results 


3.4.1 Parallel Survey Split Panel Study 


This section presents the formal results from the one and 
two-sample McNemar tests using unlinked Parallel Survey 
split panel data. Although this data was collected monthly, 
small expected cell sizes in the control panel led us to omit 
several sets of adjacent months from this analysis. Table 1 
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Table 1 
One-Sample McNemar Tests for Individual Parallel 
Survey Panels — Unlinked Data 


Treatment Panel 

Time Frame 

P2- Pi se(p, - p,) Z-Statistic P-Value 
10/92 — 11/92 -0.62 0.29 -2.18 0.03 
11/92 — 12/92 -0.47 0.28 -1.68 0.09 
04/93 — 05/93 -0.76 0.27 -2.84 0.00 
06/93 — 07/93 -0.04 0.27 -0.16 0.88 
08/93 — 09/93 -0.66 0.27 -2.42 0.02 

Control Panel 

Poe Py se(p, - P;) Z-Statistic P-Value 
10/92 — 11/92 2.44 0.81 3.02 0.00 
11/92 — 12/92 0.11 0.83 0.14 0.89 
04/93 — 05/93 0.20 0.72 0.27 0.78 
06/93 — 07/93 0.97 0.71 1.38 0.17 
08/93 — 09/93 =o 0.68 -2.54 0.01 


provides summary statistics for the one-sample “monthly” 
tests for each panel which were based on unlinked data from 
the Parallel Survey’s split panels. Table 2 provides summary 
statistics for the two-sample tests based on unlinked data. 

The reported values of p,, p,, p;, and p, are percentages 
of estimated unemployed to estimated total population for the 
panel. Recall that p, and p; are the panel ratio of estimated 
unemployed from the first and fifth interviews to the 
estimated panel population from the first and fifth interviews; 
p,and p, are the panel ratio of estimated unemployed from 
the second and sixth interviews to the estimated panel 
population from the second and sixth interviews. Data from 
the time frame of February 1993 — March 1993 are omitted: 
a CATI facility was closed during the March interview week 
because of a blizzard. 

The one-sample McNemar tests in Table 1 test the 
probability that the proportion unemployed does not change 
between the initial and the subsequent interview within the 
same panel. We use the Control panel to examine the 
unemployment flux from one month to the next in the absence 
of CATI. Note that the two significant point estimates are in 
the opposite direction. 

The entire vector of differences of proportions was 
found to be significantly different from the zero vector 
(p-value = 0.00), but the sum of the individual components 
was not found to be significant (p-value = 0.24). Conse- 
quently, we did not test any further linear combinations. 

We expected a certain amount of month-in-sample bias to 
be present in these estimates. In Adams (Bureau of the 
Census 1991), the estimates of p, constructed from the first 
and fifth months in sample of the full Current Population 
Survey were roughly six percent larger than their respective 
second and sixth month-in-sample analogues (p,). Conse- 
quently, estimates of (p, - p;) calculated from the full Current 
Population Survey data were generally negative. As seen in 
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Table 1, this was not the case with the Parallel Survey Control 
panel’s estimates: counter to our intuition, the estimated 
difference (p; - p,) is generally positive. This could be a 
function of the time difference, a geographic difference, or a 
design difference. Adams used 1987 data from the Current 
Population Survey to calculate national estimates of biases 
associated with rotation groups. Thus in each of these one- 
sample tests, the net movements are intertwined with an 
unmeasured effect from month-in-sample bias. 

Note the negative unemployment flux in the Treatment 
panel. This observation is supported by the significant 
result from the formal test of the omnibus hypothesis 
(p-value = 0.00), and the significant result for the hypothesis 
1’p = 0 (p-value = 0.00). 


The two-sample McNemar test results are presented below. 


Table 2 
Two-Sample McNemar Tests — Unlinked Parallel Survey Data 

: (P2- Pi)- — se[(2- Pr) - aa 
Time F ee, Peg Z-Statistic P-Value 

pr herd (Pz ~ Py) (P; - P;)] 
10/92 — 11/92 -3.06 0.86 =6i5)3 0.00 
11/92 — 12/92 -0.58 0.88 -0.66 0.51 
04/93 — 05/93 (085) 0.77 -1.24 0.22 
06/93 — 07/93 -1.02 0.76 -1.34 0.18 
08/93 — 09/93 1.08 0.74 1.47 0.14 


Individually, the monthly results do not demonstrate a 
clear difference in the unemployment flux between the two 
panels. On the other hand, the omnibus test statistic is 
significant (p-value = 0.00). The mean unemployment flux 
seems to be lower in the treatment panel as evidenced by the 
significant test results of the hypothesis 1’ = 0, where p is 
the vector of ((p, - p,) - (P; - P;)),’S, with each element 
corresponding to a month’s estimate (p-value = 0.01). 

In these tests, we make statements about contrasts in a 
table of probabilities, looking for indicators of the effect of a 
treatment on unemployment movement. As mentioned earlier, 
some month-in-sample bias is present in the one-sample tests. 
The tested hypotheses examine combinations of the net 
movement within a panel and month-in-sample bias. This 
problem is somewhat mitigated in the two-sample tests. 
Indeed, if month-in-sample bias is an additive term which 
affects both panels equally, it will cancel out of the test 
statistic. Moreover, this effect will be alleviated somewhat in 
the two-sample test even if it is not the same between the two 
panels or is multiplicative. Our preliminary sensitivity 
analysis bore this out: we found that the one-sample tests 
were sensitive to month-in-sample bias, but that the two- 
sample tests were not. . 

The two-sample t-tests presented in Thompson (1994) 
failed to detect a difference by panel in mean unemployment 
rate using the Parallel Survey split panel data. This contrasts 
with the Current Population Survey CATI Phase-in results: 
over two years, the CATI (Treatment) panel had consistently 
significantly higher unemployment rates than the non-CATI 
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(Control) panel. See Shoemaker (1993). In this analysis of 
Parallel Survey split panel data, we have evidence that the 
expected value of the proportion unemployed is lower in the 
presence of CATI. There are, however, some problems with 
the data. First, as previously mentioned, there is some 
confounding in the Treatment (CATI) panel, since not all 
respondents in this panel have their second interview 
conducted from a centralized telephone facility. Second, in 
each month the expected sample size in the Control panel 
cells was near ten, which could be small enough to make the 
distribution behave unpredictably. This latter problem is not 
an issue with the Current Population Survey CATI Phase-in 
study analysis presented in 3.4.2. 


3.4.2 Current Population Survey CATI Phase-in 
Project Results 


The Current Population Survey CATI Phase-in project was 
a continuation of the study presented in Shoemaker (1993). 
The primary purpose of this study was to measure the effect 
of including CATI interviewing on the unemployment rate. 
CATI interviewers in this study used an automated version of 
the old Current Population Survey paper questionnaire, which 
had a slightly modified version of the lead-in labor force 
question. More details are provided in Thompson (1994). The 
data considered in this paper are from the same time period as 
the Parallel Survey split panel data examined in 3.4.1: 
October 1992 through December 1993, again omitting the 
February 1993 — March 1993 time frame. Expected cell sizes 
in both the Treatment (CATI) and Control (non-CATI) panels 
were well over one hundred, and so all other contiguous 
months of data are included. 

The one-sample McNemar test results for both panels are 
presented in Table 3. Test statistics are constructed with 
unlinked data. The reported values of p,, p,, p;, and p, are 
percentages of estimated unemployed to estimated total 
population for the panel. 

As with the Parallel Survey split panel data, the one- 
sample McNemar tests using the CATI Phase-in data test the 
probability that the proportion unemployed does not change 
between the initial and the subsequent interview within the 
same panel. Again, we use the Control panel to estimate the 
unemployment flux from one month to the next in the absence 
of CATI. The monthly tests for the Control panel do not 
appear to exhibit any particular movement. Furthermore, the 
omnibus hypothesis test was not significant (p-value = 0.29), 
so we did not test any further linear combinations. 

Again basing our expectations on the effects of month-in- 
sample bias presented in Adams (1991), we believed that the 
Control panel estimate of p, (from the first and fifth months- 
in-sample) would be larger than its respective second and 
sixth month-in-sample analog, p,. On the average, this was 
the case: although quite variable, the estimates of p, are on 
the average about 4 percent larger than the estimates of P>- 
Because both panels are representative samples from the same 
parent sample, we assume that the month-in-sample bias 


Table 3 
One-Sample McNemar Tests for Individual Current 
Population Survey Panels — Unlinked Data 


Treatment Panel 
Time Frame 
DOP} se(p, - p,) Z-Statistic P-Value 
10/92 — 11/92 1.13 0.16 7.63 0.00 
11/92 — 12/92 0.07 0.17 0.44 0.66 
12/92 — 01/93 0.43 0.13 3.46 0.00 
01/93 — 02/93 0.00 0.14 0.03 0.97 
03/93 — 04/93 -0.25 0.14 -1.81 0.07 
04/93 — 05/93 0.63 0.13 4.99 0.00 
05/93 — 06/93 0.88 0.13 6.56 0.00 
06/93 — 07/93 0.84 0.13 6.49 0.00 
07/93 — 08/93 -0.07 0.14 =(Ohey| 0.61 
08/93 — 09/93 0.42 0.13 SIT) 0.00 
09/93 — 10/93 0.06 0.12 OD 0.60 
10/93 — 11/93 1.05 0.12 8.45 0.00 
11/93 — 12/93 0.18 0.14 27, 0.20 
Control Panel 
jy 1B se(p, - P;) Z-Statistic P-Value 

10/92 — 11/92 0.05 0.47 0.11 0.92 
11/92 — 12/92 -0.14 0.47 -0.30 0.76 
12/92 — 01/93 0.72 0.43 1.68 0.09 
01/93 — 02/93 -0.91 0.43 =2,11 0.03 
03/93 — 04/93 -0.16 0.39 -0.40 0.69 
04/93 — 05/93 -0.18 0.43 -0.42 0.67 
05/93 — 06/93 0.47 0.38 122 0.22 
06/93 — 07/93 =0:32 0.46 -0.68 0.49 
07/93 — 08/93 OL 2 0.39 les? 0.19 
08/93 — 09/93 -0.54 0.44 = 0.23 
09/93 — 10/93 -0.08 0.37 =0:22 0.83 
10/93 — 11/93 -0.63 0.42 = 150 0.13 
11/93 — 12/93 -0.09 0.37 -0.23 0.82 


behaves similarly in both panels. The Treatment (CATI) panel 
estimates of p, are larger on the average than the estimates of 
p,. Given the Control panel’s estimates behavior, this 
phenomenon provides some evidence of a CATI effect. 

Note the movement in the Treatment panel from not 
unemployed to unemployed. This observation is supported by 
the significant result from the formal test of the omnibus 
hypothesis (p-value = 0.00), and the significant result for the 
hypothesis 1’ = 0 (p-value = 0.00). In contrast to the Parallel 
Survey results provided in 3.4.1, this data provides some 
evidence that unemployment rate is higher in the presence of 
CATI. This evidence is further supported by the two sample 
McNemar test results provided Table 4. The individual 
monthly results in Table 4 provide some evidence of 
difference in the unemployment flux between two panels. 
Furthermore, the omnibus test is significant (p-value = 0.00). 
The mean unemployment flux in the Treatment panel seems 
to be higher as evidenced by the significant test results of the 
hypothesis 1’p. = 0. 

The two-sample t-tests presented in Thompson (1994) also 
detected a positive difference by panel in mean unemploy- 
ment rate using the Current Population Survey split panel data 
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Table 4 
Two-Sample McNemar Tests — Unlinked Current 
Population Survey Data 


(P2 - P1) - se[(p2 - P:) - 


i Z-Statistic P-Value 

Time Frame (p; - p}) (p; - P})] 

10/92 — 11/92 1.18 0.50 2.38 0.02 
11/92 — 12/92 0.22 0.50 0.43 0.67 
12/92 — 01/93 O20 0.45 -0.64 0.52 
01/93 — 02/93 0.92 0.45 2.03 0.04 
03/93 — 04/93 -0.10 0.42 -0.23 0.81 
04/93 — 05/93 0.81 0.45 1.81 0.07 
05/93 — 06/93 0.41 0.41 1.01 0.31 
06/93 — 07/93 1.16 0.48 2.41 0.02 
07/93 — 08/93 0.45 0.42 1.07 0.28 
08/93 — 09/93 0.95 0.46 2.06 0.04 
09/93 — 10/93 0.14 0.39 0.37 0.71 
10/93 — 11/93 1.69 0.44 3.83 0.00 
11/93 — 12/93 0.26 0.40 0.66 0.51 


i.e., including CATI interviews resulted in a higher unem- 
ployment rate. These results were consistent with the Current 
Population Survey CATI Phase-in results presented in 
Shoemaker (1993). This analysis of Current Population Survey 
split panel data reinforces that conclusion. Again, it is 
impossible to attribute the positive net migration from not 
unemployed to unemployed entirely to the effect of CATT: the 
same confounding described in 3.4.1 is present in this 
Treatment (CATI) panel. 


3.5 Discussion 


Our results appear to yield opposite conclusions about the 
effect of CATI on unemployment flux. The CATI effect is 
not, however, the same in both tests. 

Perhaps the key difference is the questionnaire. The 
Parallel Survey data was collected using the newly redesigned 
Current Population Survey questionnaire. The new question- 
naire was designed as an automated instrument. In contrast, 
the old Current Population Survey questionnaire used for the 
Current Population Survey CATI Phase-in Project was 
designed as a paper instrument. Field interviewers were 
required to memorize complicated skip patterns. To minimize 
respondent burden, both versions of the Current Population 
Survey questionnaire are designed for an average interview 
length of twenty minutes. Using an automated questionnaire, 
an interviewer can collect more (and more detailed) 
information in the same amount of time, since she no longer 
has to determine the path of the interview. Besides the 
automation difference, the wording of the labor force 
questions differs between the two questionnaires. 

Parallel Survey interviews were conducted using the same 
questionnaire both in the field interviews (using a laptop 
computer) and in the CATI facilities. In contrast, the Current 
Population Survey CATI Phase-in interviews used two 
different versions of the old questionnaire: a paper version 
for the field interviews; and an automated version, with a 
slightly modified lead-in labor force question for the CATI 
interviews. 
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Given these questionnaire differences, and the caveats 
about the Parallel Survey split panel data, we view our results 
as preliminary. Instead, we recommend pursuing this 
examination using one and two-sample McNemar techniques 
on the new Current Population Survey split panel data, which 
uses the old CATI Phase-in design and the redesigned, fully 
automated questionnaire. 


4. CONCLUSION 


We have presented two modifications of the one and two- 
sample McNemar tests using complex survey data, with 
applications from the unlinked data modification. If the 
survey does not have a longitudinal design, then the applica- 
tion using the linked data will have an unknown variance/ 
covariance structure and will include a variance component 
due to matching error. In this case, using the unlinked data 
makes sense with respect to the model’s interpretation, 
although the statistic based on the (unlinked) estimates of 
marginal probabilities may be inferior to a well-developed 
linked model. If the survey has a longitudinal design, then the 
first method may be preferred, as it is a straight-forward 
extension of the traditional test, and consequently, the 
interpretation is equivalent to the textbook interpretation. 

The two-sample McNemar test is not the sole approach 
one might use in the situation described in section 2.2.2. 
Another approach to the unlinked form of this problem would 
be to use a log-linear model for a 2 x 2 x 2 contingency table 
as in Rao and Scott (1984). In either case, there are trade-offs. 
The interpretation of the McNemar test is intuitive: it is a 
cause and effect model, or a repeated measures type of 
experimental design. The 2 x 2 x 2 contingency table model’s 
interpretation is perhaps less intuitive. Note, however, that the 
test statistic for the McNemar tests are ““Wald-like” statistics, 
which are often considered to be less efficient than the chi- 
squared type, e.g., Fay (1985). It is also worth noting that 
unlike the Rao-Scott formulation, the approach described in 
this paper makes explicit provisions for the use of linked data. 

Areas for future research include investigations into the 
power of these tests in the context of complex sample data, 
variance/covariance estimation for linked data including 
matching error variance contributions, and the difference in 
efficiency in the two approaches. In data analytical applica- 
tions, one and two-sample McNemar tests seem to have uses 
in comparing aspects of different survey methods or effects 
on responses within a method over time. The approach is 
nonparametric in its conception; when the approximation is 
good, it avoids pitfalls that may be associated with model- 
based tests. 
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APPENDIX 


For the unlinked data modification of the McNemar Test, 
(Pp, ~ pP,) is estimated by X, /N, + X_,/N, where X,, X,, Ni, 
and N, are weighted estimates, and 


Ex X, || Var(X,.)  Var(W,) 
Var(p, - p,) =|—=| |} = - ——* 
N Neg N, 


xX ,|'|Var(X,.)  Var(N,) 


25 


2 Me N; 


“ 


2X, X,[Cov(X, X.)  Var(N,) 
NTN S32 Nn? 


Var(N,)  se(N,) se(N,) 
SS 
N; N,N, 


In this appendix we discuss the derivation of the 
covariance term in the variance estimate, considering only the 
unlinked data. 


Consider the within-panel correlation 


Cov(X, ,X,) = 2s Dea kays %,] (Al) 
J=2,6 


J=1,5 


where X;, is a weighted sample level for month i, month-in- 
sample (MIS) j. Note that X,, and X, ;,, are from the same 
rotation group unless j = 4 since a rotation group is out of 
sample for eight months after being in for four. 

We assumed that the correlations between X;, ; and X,,, can 
be decomposed into three separate categories: 


1) A within-rotation-group correlation, 


l 


Cov(X; X41 5.1) =", when j = 1,2,3,5,6,7. 
2) A within-month-between-rotation group correlation, 


I 


Cov (X; ;,X; ,) =, k#/j, and 
3) A between-rotation-group between-month correlation. 


Cov (X; ,,Xi.1,) = NG k #j+1 Or J = 3}. 


Replicate estimates of these correlations were available. 


The covariance in (Al) becomes 
Cov (X,., X.,) = Cov(X, , + X15,Xo9 + X96) 
= COV(A, | Ag) + COVA Xo) e 
Cov(X, . , X, ») + Cov(X, 5 ; X, 6) 
=2(r, + y) Var(X, ), (A2) 


using the simplifying assumption that Var(X;;) is constant for 
all i and j. The variance for a full month’s estimate, 
Vane eX) is available in the form of a generalized 
variance function (GVF). We use this estimate to calculate 
Var(X; ;) by applying the following derivation: 


Var 


8 
» x, = X » Cov(X, ,,X,,) 
ey VartX, ty Cove ae) 
J jek 


=(8 + 56) Var (X; ;) 


Var (X; ) = (8 + 56)! Var 


vel 


8 
yy x, (A3) 
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Stability Measures for Variance Component Estimators 
Under a Stratified Multistage Design 


J.L. ELTINGE and D.S. JANG’ 


ABSTRACT 


In work with sample surveys, we often use estimators of the variance components associated with sampling within and 
between primary sample units. For these applications, it can be important to have some indication of whether the variance 
component estimators are stable, i.e., have relatively low variance. This paper discusses several data-based measures of the 
stability of design-based variance component estimators and related quantities. The development emphasizes methods that 
can be applied to surveys with moderate or large numbers of strata and small numbers of primary sample units per stratum. 
We direct principal attention toward the design variance of a within-PSU variance estimator, and two related 
degrees-of-freedom terms. A simulation-based method allows one to assess whether an observed stability measure is 
consistent with standard assumptions regarding variance estimator stability. We also develop two sets of stability measures 
for design-based estimators of between-PSU variance components and the ratio of the overall variance to the within-PSU 
variance. The proposed methods are applied to interview and examination data from the U.S. Third National Health and 
Nutrition Examination Survey (NHANES III). These results indicate that the true stability properties may vary substantially 
across variables. In addition, for some variables, within-PSU variance estimators appear to be considerably less stable than 
one would anticipate from a simple count of secondary units within each stratum. 


KEY WORDS: Between-PSU variance; Complex sample design; Degrees of freedom; Diagnostic; Design-based analysis; 
Satterthwaite approximation; Stratum collapse; U.S. Third National Health and Nutrition Examination 
Survey (NHANES III); Within-PSU variance. 


1. INTRODUCTION 


In work with sample surveys, it is often desirable to have 
good estimates of the variance components attributable to 
sampling within and between primary sample units (PSUs). 
For example, the magnitude of an estimated within-PSU 
variance, relative to a between-PSU variance, may influence 
decisions regarding sample allocation and related design 
issues (e.g., Hansen et al. 1953, Chapter 7). Similar relative- 
magnitude properties affect the bias of certain variance esti- 
mators derived under simplifying assumptions regarding the 
sample design (e.g., Korn and Graubard 1995, p. 278-279, 287; 
and Wolter 1985, p. 44-46). Also, some survey analysts have 
a general interest in identification of surveys and variables for 
which the between-PSU component of variance is substantially 
greater than zero. See, e.g., Herzog and Scheuren (1976, p. 398) 
and Wolter (1985, p. 47) for related comments. In addition, 
Jang and Eltinge (1996) give an example for which there is 
some interest in the within-PSU variances by themselves. 

In some application work, estimates of within-PSU 
variances and related quantities are reported with the apparent 
assumption that the estimates are stable, i.e., have relatively 
low variances. This paper shows that it can be important to 
carry out data-based checks of this assumption of stability, 
and that some relatively simple checking methods follow from 
standard design-based ideas. We emphasize methods that can 
be applied to designs with a moderate or large number of 
strata and a small number of PSUs selected per stratum. 


Subsection 2.1 reviews the relevant estimators of within- 
PSU variances and overall stratum-level variances. Sub- 
section 2.2 identifies two distinct components of the variance 
of the within-PSU variance estimator. Subsection 2.3 presents 
simple design-based estimators of the variances of two within- 
PSU variance estimators. Section 3 develops two related 
degrees-of-freedom measures. 

Section 4 examines the extent to which related design- 
based methods can be used to assess the stability of quantities 
that depend both on the within-PSU variance estimator and on 
the overall stratum-level variance estimator. Principal atten- 
tion is directed toward an estimator of the between-PSU 
variance and an estimator of the ratio of the overall stratum- 
level variance divided by the within-PSU variance. Section 4.2 
discusses one set of methods based on the stability measures 
from Section 2 and some moderately restrictive moment 
assumptions. Section 4.3 outlines a second set of methods 
based on stratum collapse. 

Section 5 applies the main ideas of Sections 2 through 4 to 
variance estimates computed for the U.S. Third National 
Health and Nutrition Examination Survey. Section 5 also uses 
a simple simulation-based method to assess the consistency of 
the observed measures with standard assumptions regarding 
variance estimator stability. The Section 5 results suggest that 
the true stability of within-PSU variance estimators can be 
substantially less than anticipated from a simple count of the 
number of secondary units contributing to each PSU. In 
addition, the results indicate that the stability properties of 
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within-PSU variance estimators and related quantities can 
vary substantially across different variables collected in the 
same survey. Section 6 gives additional comments on the 
methods and empirical results presented here. 


2. WITHIN-PSU AND OVERALL 
STRATUM-LEVEL 
VARIANCE ESTIMATORS 


2.1 General Notation 


In principle, we could use either design-based or model- 
based methods to examine within-PSU and between-PSU 
variance components. The present work will take a design- 
based approach. This is consistent with some related previous 
literature, e.g., Wolter (1985, p. 40-41, 47). The design-based 
approach will be especially useful in highlighting some 
strengths and limitations of the proposed stability-assessment 
methods. For example, in Section 2.3 this approach will give 
us some indication of specific design features that may affect 
variance estimator stability. Also, in Section 4 the design- 
based approach will help to clarify the extent to which certain 
moment restrictions are needed to justify one set of stability 
measures. 

Following the notation and ideas in Wolter (1985, 
p. 43-47), consider a stratified multistage sample design with 
L strata and with N, primary sampling units (PSUs) contained 
in stratum h = 1, 2, ..., L. We select n, PSUs with replacement 
and with per-draw selection probabilities p,;. Within selected 
PSU (h,1), we select n,; secondary sample units (SSUs) with 
replacement and with per-draw selection probabilities p,,. 
Further subsampling is carried out within a selected SSU to 
obtain n,,; individual elements for interview or examination. 
The stability-assessment methods developed here are intended 
primarily for designs with moderate or large L, relatively 
small n, (e.g., n, = 2), and relatively large n,;. Designs with 
these characteristics are often used in large household inter- 
view surveys, e.g., the health survey discussed in Section 4. 

We will focus on mpl pe of a population total 
Ye ate ae dae oe es nije? page 48 
a survey esi for senent kin SSU/ in PSU 7 in stratum A, N,, 
is the number of SSUs in PSU (A,i), and N,,, is the number of 
elements in SSU (A, i, 7). Extensions to nonlinear functions of 
population totals are straightforward and will be considered 
in the applications in Section 5. A standard design-based 
estimator of Yis¥ = Yy_, Y,, where 


6 Mh Mi "ny 
We ye yg. Whijk Yhijk? (2.1) 
i=l j=l k=l 


Wpijx 18 the customary weight derived from selection proba- 
bilities and sample sizes to ensure unbiased estimation of each 
Y,, and the lower-case terms y,,, denote sample observations. 
In subsequent work, it will be useful to rewrite expression 
(2.1) as 
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ae Lane a Mi 
where ¥y, = Map Lj Zpij ANG Zpij = MyM pi Pri Le=1 nije Vrije 


2.2 Within- and Between-PSU Variances 


Throughout this discussion, expectations and variances 
will be defined with respect to the sample design. Under the 
conditions stated above, -_ variance of f Y is Vy) = May, Vie 
where Hog Vaeeavi = aa DePuy sa Ne aipirs 
n, yh ios oe iit wate = “V(P, enlicaa oe e.8., 
Wolter (1985, p. 42). Note Ee Serie that Y,.; is the true 
population total for selected PSU (A, i), and that a, reflects 
the variability in Vp - Y,, attributable to subsampling at the 
SSU and finer eee 

A customary unbiased estimator of the overall stratum- 
level variance V, is 


2 


ny, 
Vii, = ny 1) Diy Ves Ys 
Hl! 


and the corresponding estimator of vy = We LV \eis 
V(Y) = ei AVX, Me 

Now consider estimation of the within-PSU variance Vy,. 
Since a is a sample mean of the independent and identically 
caetbared terms z,,,, Standard arguments show that, for a 
given PSU (h,1), an unbiased estimator of oP is 
One iP (ny, - 1G pis 1 gy ~ Vee Thus, an unbiased 
estimator of Vy, is 


ny ny, Nyj 
4 Sige Te =| 4 LG 
Vn = SS Ny Phi Orn = Dy ny, (NM, ~ 1) > nig ~ nid 
i=1 Jail jell 


“ "hij zz =n, yt 
where x, i; = My da-1 Mhijk Yrije 24%); = Mj Lj-1%44- Note that 
the latter expression for V,, uses only sample sizes, the 
observations y,;, and the customary weights w,,,.. 


2.3 The Variance of Vy, 


A direct modification of standard conditional-moment 
arguments shows that the variance of V,,, is Yp, + Yy,,» where 


= V(n, ay Pras O>4i) 


and 


N, 
*3 Ea 
Yon ~", 9 Pri V(G>,; | 4,1). 


i=1 


Thus, the variance of ae itself depends on a sum of 
between- and within-PSU variances, and the relative 
magnitudes of y,, and y,,,, depend on trade-offs among ace j 
Din and n,;. For example, under regularity conditions, the terms 
V(6;,;| 4,1) are approximately inversely proportional to n,,. 
Thus, if the n,; are uniformly large within stratum h, then yy, 
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may be relatively small. Also, if the terms peor are 
approximately constant within a given stratum, then Y,, may 
be relatively small. Conversely, marked heterogeneity of 
Pe ish may inflate y,, and thus inflate viv wh) as well. 

In addition, note that under the stated design conditions, 
Vi is the sample mean of the independent and identically 
distributed terms n, ipa Ope Thus, an unbiased estimator of 


the variance of Ves is 


Nh 
V(V : - 1 2, ae? 
VVs) ne (a= 1)" 9, py oye Ven) G2) 
i=1 


Some applications focus on the full-population level, 
rather than on individual strata, and so the “within-PSU”’ 
contribution of interest is the sum of the within-PSU 
variances, Vy, Se , Vy, Under the conditions given above, 
an unbiased estimator of ve is Ve =n Vom . Also, since our 
sampling and subsampling are Aiepcndent across strata, we 
have V(Vy) a Noy (Yen * Yw,)> and an unbiased estimator of 
VVeyis 


Ms 
ViVey— > AV CVn: 
h=1 


Finally, note that the preceding development used the 
assumption of sampling with replacement at both the primary- 
and secondary-unit levels. Two applications of result (2.4.16) 
in Wolter (1985, p. 46) show that under mild conditions that 
hold for many, but not all, without-replacement designs, 
Vee will be unbiased or conservative for the true within-PSU 
variance; and viv wr) Will be unbiased or conservative for the 
true variance of Vers A formal technical statement and proof 
of this result is available from the authors. 


2.4 Balanced Interpretation of Stability Measures 


The remainder of this paper uses V(Viy) and related 
quantities to assess the stability of variance-component 
estimators. In working with these results, it is useful to 
remember that data-based measures of variance estimator 
stability are justifiably viewed with some caution, because 
they are functions of fourth sample moments, and thus are 
themselves subject to a considerable amount of sampling 
variability. See, e.g., Fuller (1984, p. 111). This caution 
carries over to the proposed estimator V(Va,) and to the 
related statistics discussed in Sections 3 and 4 below. 

However, one should not overstate this caution to the point 
of making no attempt at data-based assessment of variance 
estimator stability. The estimator CORY and the related 
measures in Sections 3 and 4, are relatively simple to 
compute, and provide diagnostics that can help to identify 
variables for which: 

(a) the instability of Veo is especially problematic; or 
(b) the instability of Ver has a substantial effect on the 


precision of estimators of the relative magnitudes of 
between-PSU and within-PSU variances. 
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Consequently, interpretation of specific values of V(Vy,) 
and related stability measures should reflect a balance 
between the abovementioned general caution and a recogni- 


tion of their potential diagnostic value. 


3. TWO STABILITY MEASURES FOR 
WITHIN-PSU VARIANCE 
ESTIMATORS 


3.1 Degrees-of-Freedom Diagnostics for Variance 
Estimator Stability 


Some analysts prefer to express variance estimator stability 
through “degrees of freedom” measures related to the 
Satterthwaite (1941, 1946) approximation. To introduce this 
idea, consider a general variance estimator V, and note that 
{E(V)} 'aV has the same first and second moments as a 
chi-square random variable on d degrees of freedom, where 
d is the solution to the equation, 


2{E(V)}’ - V(V)d =0. 


If the distribution of {E(V)} ‘dV is indeed well 
approximated by a chi-square distribution, then d may be 
viewed fairly literally as a “degrees of freedom” term. 
Otherwise, d can be viewed as twice the inverse of the 
squared coefficient of variation of V. In either case, d has a 
certain appeal because it is scale-free, and can be tied fairly 
directly to notions of “effective sample size” in the evaluation 
of variance estimator performance. Subsection 3.3 gives 
related comments for two special cases. 

Given an unbiased estimator V(V) of the variance of V, 
one may compute a “degrees of freedom” estimator d as the 
solution to the unbiased estimating equation 


2{V? - W(V)} - VV) d = 0, (3.1) 


Lea, @= ={V(V)} '2V7?-2. Under mild regularity condi- 
tions, “e 1g converges in probability to one, provided 
{ Vv(V)} 'W(V) and {E(V)} V both converge in probability 
to one. 


3.2 Degrees-of-Freedom Diagnostics for Pooled and 
Stratum-Level Estimators of Within-PSU 
Variances 


We can apply these general degrees-of-freedom ideas to 
the within-PSU variance estimators VA and Ve developed 
in Section 2. First consider the case in whieh eres is intrinsic 
interest in the stability of individual stratum-level estimators 
Vin The, associated “degrees of freedom” measure is dy, = 
{ VO ME . For designs with large Nhs one may use (3.1) 
to compute estimators de {V(V, ay ys nes - 2 separately 
for each stratum. For designs with small n, (a g., n, = 2 for 
each stratum), the estimator ae itself may be very unstable. 
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Consequently, it also is useful to consider the alternative 
combined estimator 


under the assumption that all dy, equal a common value dyp. 

Now consider the pooled within-PSU variance estimator Ve 
developed in Section 2.3. The resulting “degrees of freedom” 
measure is dy, = { Weave ave, and expression (3.1) 
suggests the estimator 


ail! 


L 
digas 1] 2V7-2. 
h=1 


3.3. Comparison of dy, and dy, to Direct SSU Counts 


To interpret de and ae as stability measures, consider 
the following idealized setting. Assume that for all h, the PSU 
counts n, are equal to a common value n,, say; and that for 
all h and i, the SSU counts n,; are equal to a common value 
n,,. In addition, assume that the terms Pri O>,; are constant 
within each stratum; and that, conditional on (h, 1), each 
Se ing) Om. is distributed as a chi-square random vari- 
able on n,, - 1 degrees of freedom. Then routine arguments 
show that dy, = n,(n,, - 1). If the preceding assumptions are 
satisfied approximately, and if the product n,(n,, - 1) is large 
(greater than 40, say), then a data user may be inclined 
to view Vis as relatively stable, or equivalently, to view the 
errors a ~ Vy, a8 negligible. This appears to be the 
reasoning Wises implicitly when estimates V,, are treated as 
known values in design or analysis work. However, the 
application in Section 5 will give some examples for which 
this reasoning is problematic, so that evaluation of the 
estimates Ao. is important. 

Also, under the idealized conditions described above, and 
under the additional assumption that the V,,, are all equal, we 
have do. = Lain): 


4. COMPARISON OF WITHIN-PSU 
AND OVERALL 
STRATUM-LEVEL VARIANCES 


4.1 Estimators of Between-PSU Variances and 
Related Variance Ratios 


Section 1 cited some applications that hinge on the magni- 
tude of Vy, relative to V,. The specifics of the relative- 
magnitude comparisons vary with the individual application, 
but interest generally focuses on differences or ratios. 
For example, recall that V,, = V, - Vy, and define the overall 
between-PSU variance term V, =)’, : Ven. In addition, note 
ra unbiased estimators of Vp, ae V, are Vis =Va> V_,, and 

=n, 1 Vp, Tespectively. 

eae define the ratio Ry, = Vy 'V(Y), the magnitude 
of the overall variance V(r) palnive to the within- PSU 
contribution V,,. A direct estimator of Ryy is R be = Vy VY). 
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Note that if Vy,V, =Ryy for all h, then Ry, could also be 
viewed as a pooled estimator of this common stratum-level 
ratio. 

For both Vv, and ee , Stability assessment involves the 
variance of V, ana acc covariance of Vax with Vi Estima- 
tion of the these moments can be bomewtty piablenite for 
surveys that select small numbers of PSUs from each stratum. 
We consider two approaches to resolving this problem. 
Section 4.2 uses moderate restrictions on the moment 
structure of (Vy, : V,) to develop estimators vv, and 
related quantities. Section 4.3 uses stratum collapse to 
develop alternative stability measures. 


4.2 Stability Measures Based on VV, ,) and Moment 
Conditions 


4.2.1 Moment Conditions 


Under moderate moment restrictions, we can estimate the 
variance of V, directly from V, itselt Specifically, assume 
that the variance of V, equals (n, - 1)” De this would hold, 
é.g., under the atandard assumption that V, (n, - 1)V, is 
distributed as a chi-square random variable on n, - 1 degrees 
of freedom. As in Sections 2 and 3, we continue to assume that 
V, is Se ac for, V,,. Then routine moment arguments show 
that (n, aD) LoVe is an unbiased estimator of the variance 
of Wire 

ik the remainder of Section 4.2, we will also assume that 
Cov(V,,,; V ,) = 0. Routine conditional- moment arguments 
show that the will hold if the terms jays one are equal within 
a given stratum; and if, conditional on (h,i,j), the SSU-level 
estimates x,,, are normally distributed, so that om is condi- 
tionally independent of 15 


4.2.2 Stability Measures 


Under the conditions stated in Section 4.2.1, unbiased 
estimators of V(V,,) and VV.) are 


V(V»,) = (n, +1) 12V, + V(Vy,) (4.1) 


and WV») = pew viv pp)» Where VVi,) is defined in expres- 
sion (2.2). Also, under the same conditions routine ratio- 
estimation arguments lead to the variance estimator 


L 
Vik eo ava) { eb) PQVe= Roy Va h (4.2) 
h=1 


4.3 Alternative Stability Measures Based on Stratum 
Collapse 


The assumptions of Section 4.2.1 may be problematic in 
some applications. For example, for some survey designs and 
variables, the SSU-level estimators x,,,; may have markedly 
nonnormal distributions, so the assumption Cov (Vem, V2 0 
may not hold. For these cases, one may consider the use of 
stratum collapse to produce alternative estimators of V( V,) 


and V(R yy). 
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Specifically, partition the set of L strata into G prespecified 
groups, with L, strata contained in group S,, g = 1, ..., G. 
With this new notation, note that 


Standard stratum-collapse methods (e.g., Wolter 1985, 
Section 2.5) then lead to the alternative variance estimator, 


G 
VV y=) (eka 1s De 
g=l heS, 


where D,, = Vere Es Lies, Va; . Similarly, a collapsed- 
stratum variance estimator for R,,, is, 


G 
VG el yl (ele Lod: Cr 
g=l 


heS, 


where: C2 = (V,— Ry Vn) Ly yes. Vj- Ry ys)- 

In general, collapsed-stratum variance estimators require 
some care in interpretation; see, e.g., Rust and Kalton (1985), 
Wolter (1985, Section 2.5) and references cited therein. For 
example, collapsed-stratum variance estimators generally will 
be conservative. In addition, for cases with moderate L, the 
variance estimators V*(V,) and V* (Rey may themselves 
have limited stability. 


5. APPLICATION TO THE U.S. THIRD 
NATIONAL HEALTH AND 
NUTRITION EXAMINATION 
SURVEY 


5.1 Sample Design and Estimation Methods 


The methods proposed in Sections 2 through 4 were 
applied to data from Phase I of the Third National Health and 
Nutrition Examination Survey (NHANES III). National 
Center for Health Statistics (1996) gives a general description 
of this survey, including special characteristics associated 
with Phase I (data collected between 1988 and 1991). For the 
present discussion, three aspects are of special interest. First, 
variance estimators were constructed on the basis of a 
collapsed design involving L = 22 strata (large groups of 
counties), with two primary sample units (generally individual 
counties) selected per stratum. Second, each selected PSU 
had a relatively large number of selected SSUs (generally 
groups of city blocks, or similar rural areas). The number of 
selected SSUs within each stratum ranged from 30 to 63, with 
a mean of 45.8. 

Third, additional subsampling within each SSU led to 
selection of the survey elements (individual noninstitu- 
tionalized U.S. civilians). Each selected person was asked to 
respond to a health questionnaire and to participate in a 
detailed medical examination. Twelve of the resulting 
variables are listed in Table 1. 
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Standard weighted ratio estimates 6 were computed for 
the population means of each of the twelve variables listed in 
Table 1. The first two columns of Table 2 present the 
corresponding variance estimates V(@) and van As part of 
a larger study of the within-PSU variances V,,, discussed in 
Jang and Eltinge (1996), there was considerable interest in the 
stability of the individual estimates Vs Since we had n, = 2 
for each stratum, the reasoning in Section 3.2 indicated that 
it was not feasible to examine the individual terms den 
Consequently, Section 5.2 will examine the pooled measure 
7s of the stability of the Va, and will also present some 
related simulation-based tests and diagnostic plots. 


Table 1 
Twelve NHANES III Variables 
Variable name Description 
HAE2 Told by health professional that you had 
hypertension (indicator variable) 
HAE7 Told by health professional that your blood 
cholesterol was high (indicator variable) 
HAD1 Told by health professional that you had 
diabetes (indicator variable) 
HAR3 Do you smoke cigarettes now? 
BMPHT Height 
BMPWT Weight 
HDRESULT HDL cholesterol 
TCRESULT Serum total cholesterol 
LEAD Blood lead, in micrograms per deciliter 
log(LEAD) Natural logarithm of blood lead 
BP1K1 Systolic blood pressure 
BP1KS5S Diastolic blood pressure 


Table 2 
Variance Estimates and Stability Measures for 
Twelve NHANES III Variables 


Variable name vs 1404 ) dee ie 
HAE2 0.0000385 0.00005 11 237) 425.8 
HAE7 0.0000821 0.000135 13.6 225.6 
HAD1 0.00000956 0.00000749 8.8 160.6 
HAR3 0.000122 0.000205 6.4 125.8 

BMPHT 0.0223 0.0416 15.3 OUT sy 
BMPWT 0.104 0.122 8.6 139.2 
HDRESULT 0.0743 0.163 11.5 196.2 
TCRESULT 0.590 0.860 21.2 353.9 
LEAD 0.00388 0.00657 2.8 48.8 
log(LEAD) 0.000211 0.000678 10.5 174.9 
BP1K1 1EO7S 2.896 1.0 26.5 
BP1KS5 0.252 0.217 WD 52.9 


In addition, there was interest in the extent to which the 
variances of the V,,, contributed to the variances of the pooled 
quantities V, and R,,,. Section 5.3 explores this question. 
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5.2 Within-PSU Variance Estimates and Associated 
Stability Measures 


5.2.1 Comparison Across Variables 


The final two columns of Table 2 report the degrees-of- 
freedom estimates dy, 9 and thes for the twelve NHANES III 
variables. Note Beco that the stratum-level stability 
measures an » are relatively low, compared to the mean of 
45.8 SSUs per Reena For example, all of the variables have ave 
less than 24, and five (HAD1, HAR3, BMPWT, LEAD and 
BP1K1) have ig less than 10. Due to the interest in the 
ae » described above, this led to two general questions. 

(1) Are the observed dun consistent with the nominal 
degrees-of-freedom alte dy. that one would anticipate 
from the direct SSU counts n,,; + n,. - 2? 

(2) Conversely, are the observed Paks consistent with 
distributional conditions that produce considerably 
smaller values of d,,.? 

Standard large-sample-theory-based tests for (1) and (2) 
would have depended on eighth sample moments, and thus 
were inadvisable in the present case, due to the relatively 
small values of L = 22 and n, = 2. Instead, the following 
simulation-based test was carried out. 


5.2.2 Simulation-Based Interpretation of Stability 
Measures 


This simulation work covers six cases involving different 
values of two terms. The first term, denoted d,; , represents 
the degrees of freedom associated with the variance estimator 
Be in PSU (i, i). The second term, denoted R,, , is the ratio 
of the expressions p,, 05. in the first and second sample 
PSUs in stratum h. 

In each of the six cases discussed below, independent 
pseudorandom variables g,; were generated from a Sen as 
distribution on d,; degrees of freedom for h = 1, 2, ..., 22 
and i = 1, 2. Re-scaled variables oes ath VumiSni WETE then 
computed, where V,,; is a Pig variable equal to one 
with probability one-half and equal to R,, with probability 
one-half. The random variables g,; and Vy,; are mutually 
independent. Finally, the sums Von = Vora + Vin. and the 
associated measures WV), VVy) and de wo Were com- 
puted. This was repeated 10,000 times. 

Table 3 lists the values of d,,; and R,, covered in the six 
cases, and Table 4 lists the resulting simulated means, 
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standard deviations and quantiles for as : When interpreting 
the results for these cases, note that randomness of the g,, 
corresponds to the estimation error in the Ooi due to 
subsampling at the SSU and lower levels; and randomness of 
the V,,,, reflects the variability of the Pri Os induced by 
sampling of PSUs within a given stratum. 


Table 3 
Cases Covered for the Simulated Quantiles 


Cases d Ry 


DD 
Obs. Dist. 
5 
D2 
Obs. Dist. 
5 


NOR WN 
WOO ORR 


Case 1. uses d,4= 22 \and ‘R,,=) 1yArpumentse irom 
Section 3.3 show that the resulting ee are distributed as 
constant multiples of a chi-square random variable with 
dg = = 44 degrees of freedom. Thus, for Case 1, the choice of 
d,; = 22 has led to simulated quantiles of dy > that are 
approximately those that one would anticipate oe the mean 
SSU count of 45.8 observed for Phase I of NHANES III, 
under the setting described in Section 3.4. Note that even in 
this idealized Case 1, the relative variability of the d 
fairly high. 

Now compare the dy » reported in Table 2 to the simulated 
quantiles from Case 1. All twelve of the observed Gk, » fall 
below the 0.025 simulated quantile of 24.8; and ten wi the 
twelve fall below the 0.005 quantile of 21.1. Thus, the ris 
observed for the NHANES III variables are not consistent 
with a nominal dy) = 44 produced in the idealized setting 
covered by Case 1. 


wo 3S 


5.2.3 Simulation Under Alternative Conditions with 
Smaller ane 


In general, the distribution of ‘es may deviate from that 
observed under the idealized Case 1 due to: (a) variability in 
the true SSU counts n,,; (b) limited stability of the PSU-level 
estimates Ow ; and (c) heterogeneity of the true PSU-level 
terms 0;,,. Cases 2 through 6 cover the combined effects of 
these three factors. 


Tabled 
Simulated Quantiles for dy, 


Cases Mean S.D. 4 005 901 9.025 90s 9.10 
1 48.9 eee Aleit DSS 24.8 27.4 30.7 
a) 48.3 17-5 20.7 21.9 24.2 26.8 29.9 
3 LS 4.7 4.1 AIS) Sail 5.6 6.4 
4 5 DT 1.4 1.6 2.0 23 Del 
5 Shs) J] 1.4 1.6 1.9 23 Del 
6 3h5 Ae\\ 0.7 0.8 1.0 2 1.5 


92s 950 9.15 990 9.95 9.975 9.99 9.995 


36:75 .45:5 SK IAPS tenlesy ey NO Sey 1202 
36.3 ASS SOL TOA ROS 7 OR LOG Zt 820 
8.0 10.3 BES: WES 2:0) ee 23: OO. CuSO 
S)5i/ 5.0 6.8 8.9 10.5 12.1 14.8 16.7 
Sh) 5.0 6.7 8.9 10.6 W231 14.1 16.1 
Dell 3.0 4.4 6.0 7.4 8.8 e2, 12.6 
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The design for Case 2 was identical to that for Case 1, 
except that the d,; were random variables, selected with equal 
probabilities and with replacement from the 44 values n,;— 1 
corresponding to the 44 SSU counts n,, in the original data- 
set. The resulting simulated quantiles of ag are similar to 
those for Case 1. 

Case 3 uses d,; = 5 and R,, = 1; the resulting Va are 
distributed as constant multiples of chi-square random 
variables with d,, = 10 degrees of freedom. The simulated 
quantiles for Case 3 were somewhat more consistent with the 
a observed for the NHANES III dataset. For example, ten 
of the twelve variables have dy wo at or above the simulated 
0.10 quantile of 6.4. However, two of the variables (lead and 
systolic blood pressure) had their doe below the simulated 
0.005 quantile for Case 3. 

Cases 4 through 6 cover more extreme cases of instability, 
induced by use of the scale factor R,, = 9. A scale factor 
different from one introduces a component of variability 
associated with sampling of PSUs with unequal ee and 
causes the i to have distributions outside of the rescaled 
chi-square family. Cases 4 through 6 use the same d,, values 
used in Cases 1 through 3, respectively. The smallest 
observed NHANES III ane values are somewhat more 
consistent with the simulated quantiles for Cases 4 through 6, 
although the He = 1.0 for systolic blood pressure still falls 
below the simulated 0.005 quantile for Cases 4 and 5, and 
is approximately equal to the simulated 0.025 quantile for 
Case 6. 

In addition, note that the three largest observed ‘bee values 
(for the hypertension indicator, the total cholesterol measure, 
and diastolic blood pressure) fall above the simulated upper 
0.995 quantiles for each of cases 4 through 6. This, in con- 
junction with the abovementioned results for Cases 1 
through 3, indicates that the twelve observed de are 
consistent with settings that produce substantially different 
true dy, values for different variables. 

Taken together, these simulation results suggest that for 
the twelve NHANES III variables examined, the stability of V,,, 
may be substantially worse than one would anticipate from a 
simple count of SSUs within each stratum; and that the true 
stability measures dy) may vary substantially from one 
variable to the next. 


5.2.4 Diagnostic Plots 


In a purely numerical sense, de depends on the magni- 
tudes of the V(Vin) relative to the terms av, . Conse- 
quently, diagnostic plots of VVa)- against vee ne useful 
in the identification of specific patterns and ‘ Sob strata” 
that lead to unusually high or low dae 

Figures 1 through 3 give plots for the variables HAE2 
(diagnosed hypertension), log(blood lead), and blood lead, 
respectively. Each plot was constructed with horizontal and 
vertical axes on the same scale. The plot for HAE2 has the 
bulk of its points well below a line with slope = 1 and 
intercept = 0. In addition, the values of V(V,,,)” that are large 
in an absolute sense are still substantially less than the 


163 


corresponding Vs . This is consistent with the relatively large 
degrees-of- freedon value die = 23.7. The plot for log(blood 
lead) shows a somewhat Srenien concentration of points near 
the line with slope = 1 and intercept = 0, which is consistent 
with the somewhat smaller value aa = 10:5. 

The plot for blood lead shows one apparent outlier: the 
largest value of ViVey" is approximately equal to the 
corresponding V,,, . For this stratum, we examined the terms V,, 
and pe Oo for unusual patterns, e.g., extreme individual 
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Figure 2. Plot of Viv ays against Ve, for log (blood lead) 
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Figure 3. Plot of VVay2 against Mer, for blood lead 


values or extreme, element- level weights. Here, one of the 
two associated Pa On values was approximately equal to 
Zero and the other was the largest of all the PSU-level 
terms Pi, On In addition, the stratum in question had the 
largest V, value. However, this stratum did not display 
outlying falies of V(V,,,)* and V, for other related 
variables, e.g., log (blood lead). Thus, es unusual pattern 
observed for blood lead may be attributable to a few very 
high observed values for the blood lead variable, rather 
than to the sample design or weighting as such. Within this 
context, note that at the population level in the U.S., lead 
measurements tend to have a roughly lognormal 
distribution, and high lead measurements show some 
tendency to be clustered together due to environmental 
factors. 


5.3. Between-PSU Variance Estimates and the 
Variance Ratio Ryy 


Table 5 presents the estimates V, and Re and 
associated standard errors, for the twelve NHANES III 
variables. Of special interest are the columns labeled 
VV, )1WV,,), the proportion of the variance estimate 
v(V,) that i is ant ae to the within-PSU variance term; 
and VR yy) 'V eV Vea: the corresponding proportion 
foraRyy Rete large values for these proportions 
indicate that VV) makes a substantial contribution to 

VV p) and VR wy) for the variables in Wage 

Note that the proportion VR yw) 1V Sev 
greater than or equal to 0.3 for blood lead, iat ee 
blood pressure) and BP1KS (diastolic blood pressure). For 
blood lead and BP1K1, the large proportions arise primarily 
because of the relatively large value of V( Vy For BP1KS, 
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Me - Table 5 
Estimates of Ve and Re for Twelve NHANES III Variables 


with Associated Standard Errors and Relative 
Within-PSU Contributions 


Variable name Ve se(V,) v(V,)! V(V,,) 
HAE2 0.0000126 0.0000188 0.020 
HAE7 0.0000532 0.0000445 0.030 
HAD1 -0.00000208  0.00000246 0.186 
HAR3 0.0000825 0.0000703 0.047 

BMPHT 0.0193 0.0114 0.027 
BMPWT 0.0174 0.0400 0.096 

HDRESULT 0.0887 0.0744 0.010 

TCRESULT 0.270 0.253 0.031 
LEAD 0.00269 0.00188 0.168 

log(LEAD) 0.000468 0.000205 0.012 
BP1K1 1.823 0.997 0.081 
BP1K5 -0.0351 0.0793 0.367 

Hens se(Ryy) ViRyy) VyRayV Vy) 

HAE2 1327 0.491 0.034 
HAE7 1.648 0.556 0.077 
HAD1 0.783 0.247 0.123 
HAR3 1.676 0.600 0.122 
BMPHT 1.864 0.530 0.089 
BMPWT 1.168 0.391 0.126 

HDRESULT 2.193 1.020 0.047 

TCRESULT 1.458 0.436 0.063 
LEAD 1.694 0.555 0.367 

log(LEAD) Bazi 1.025 0.112 
BP1K1 2.699 1.142 0.391 
BP1K5 0.861 0.300 0.300 


WV, 7) 1S not as large on a relative scale, but the proportion 
ViRyy)s Ly Ruy WV) is still large because ve is not 
small relative to vir ). For all three variables, the relatively 
large values of V(Ry,) ‘Vy lech (ie) indicate that it is 
important to account for the variance V( V yw) when one con- 
siders the stability of Ro . For BP1KS, a similar comment 
applies to the effect of VV, ) on the stability of NG 


6. DISCUSSION 


This paper has presented three main ideas. First, due to 
the role that estimated within-PSU variances V,, play in 
survey design and analysis, it is important to account for 
the sampling error encountered in estimation of V,,. 
Second, standard design-based estimation methods lead to 
relatively simple estimators of the design variance of Vee 
In general, interpretation of these stability measures 
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requires some caution. However, they can provide useful 
diagnostics for the identification of variables for which the 
instability of vA. is especially problematic, or has an 
especially pronounced effect on the variance of related 
quantities like Ve and Rea Third, the application to the 
U.S. Third National Health and Nutrition Examination 
Survey (NHANES III), and associated simulation work, 
indicated the following. 


(i) For different sets of variables, the observed stability 
measures dee are consistent with substantially 
different sets of stability conditions. 

(ii) For some variables, the estimators Vos are 
considerably less stable than one would anticipate 
from a direct count of secondary sample units. 

(iii) For some variables, the estimated variance of ae 

makes a substantial contribution to the estimated 


A 


variances of the estimated between-PSU variance V 


a B 
and the variance ratio coe 
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Asymptotic Variance for Sequential Sampling Without 
Replacement With Unequal Probabilities 


YVES G. BERGER’ 


ABSTRACT 


We propose a second-order inclusion probability approximation for the Chao plan (1982) to obtain an approximate variance 
estimator for the Horvitz and Thompson estimator. We will then compare this variance with other approximations provided 
for the randomized systematic sampling plan (Hartley and Rao 1962), the rejective sampling plan (Hajek 1964) and the 
Rao-Sampford sampling plan (Rao 1965 and Sampford 1967). Our conclusion will be that these approximations are 
equivalent if the first-order inclusion probabilities are small and if the size of the sample is large. 


KEY WORDS: Sampling with replacement; Randomized systematic sampling plan; Rejective sampling plan; Rao- 
Sampford sampling plan; Inclusion probabilities; Horvitz-Thompson; Yates-Grundy. 


1. INTRODUCTION 


Consider a finite population U, containing N units and a 
subset U, of Uy comprising the first units k of Uy. Let T4,;) 
denote the first-order inclusion probabilities for a population 
U,. We assume that they are proportional to an auxiliary 
variable. These probabilities have two arguments: the size k 
of the population and the serial number i of the unit within the 
population. We assume that 7,,;, < 1 for all i and that all 
k > n. This hypothesis has more chance of breaking down 
when k is small, i.e., close to n. We can solve this problem by 
assuming that the values of the auxiliary variable show little 
dispersion for those units occurring at the beginning of the 
population. 

Let Tq,;;) denote the second-order inclusion probability of 
units 7 and j for a population U,. These probabilities are 
dependent on the sampling plan used. 

We will use the Horvitz- -Thompson estimator (1951) to 
estimate the total Y” , Y, of a variable Y. This estimator is 
given by 


YG 
tor = Y) ——: (1) 


ieSy Twi) 


where S, is a sample of Uy. We assume that the size of S, is 
constant and equal to n. 

Given that the size of the sample is fixed, a variance 
estimator of (1) is given by the Yates-Grundy estimator 
(1953), 


5 ahi | A¥s ab ¥ 
Very; Sop tanta (2) 


iSy iSyisg Mi [Man “Moy 


where 


Asis = Favsisn ~ Manga Mensay (3) 


Let us consider the sample size sequence {n,, n, ..., n,, ...} 
and the population size sequence {N,, N. N,, .-.}, where 
n, and N, increase whenever v - ~. To simplify the problem 
we eliminate the index v. 

The asymptotic approach used here is that of Hajek (1964): 


N 
d= », Ty. Ul 
rr 


co 


Ten. l-> 


which | means that n - ~ and (N - n) - »&, given that 
d< Me , [1 - 1(N;j)] = N - nand that d < See ,T(N;j) =n. 

In on 2, we introduce the Chao sampling plan (1982) 
as well as three results linked to first and second-order 
inclusion probabilities. In section 3, we provide an approxi- 
mation of Twy,;;). In section 4, we propose an approximation 
of the Yates-Grundy variance. Section 5 compares this 
variance approximation with other approximations proposed 
for the randomized systematic plan, the rejective plan and the 
Rao-Sampford plan. Two numerical examples are provided in 
section 6. 


2. CHAO SAMPLING PLAN 


This is a sampling plan without replacement with unequal 
probabilities, of fixed size. This method is a generalization of 
the method used by McLeod and Bellhouse (1983) for a 
simple plan. 

Let S, denote a sample of size n of U, with a set { %q,): 1€ U;,} 
of first-order inclusion probabilities. The Chao plan provides 
for a sample S,,, of size n of U,,, with a set {%Q,.): 1 € Uz} 
of first-order inclusion probabilities. The method entails 
selecting the (k + 1)-th unit with the probability 7,1... If 
this unit is not selected, then we take S,,, = S,; otherwise we 
take S,,, = S, u {k + 1}\{j}, where 7 is a unit selected at 
random within S,. The procedure starts from an initial sample 
S, = U,, comprising the first units n of the population. 


' Yves Berger, Université Libre de Bruxelles, Laboratoire de Méthodologie du Traitement des Données, C.P. 124, Avenue Jeanne, 44, Bruxelles, Belgique, 
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The Chao plan provides the advantage of being sequential. 
In fact, it allows us to select a sample through a simple 
sequential run of the population. The systematic plan is 
another sequential plan that is often used. However, the latter 
is inconvenient in that it induces zero second-order inclusion 
probabilities. We can avoid this problem by randomizing the 
systematic plan. In such a case, the population is ordered at 
random before the sample is selected. This operation 
eliminates in part the problem of zero second-order inclu- 
sion probabilities. As will be seen at the end of this section, 
the Chao plan offers the advantage of not having any zero 
second-order inclusion probabilities. Randomization is there- 
fore not needed for the latter. 

The rejective plan and the Rao-Sampford plan are incon- 
venient in that they are not sequential. In fact, the units are 
selected at random with replacement within the population. If 
a unit is selected twice, we are forced to select a new sample. 
These two plans, although they are more easily understood, 
are more difficult to implement than the Chao plan. 

The following theorem, which is a direct application of the 
theorem given by Chao (1982), provides a relation between 
the first-order inclusion probability 74, of the i-th unit of U, 
and the first-order inclusion probability ™,,,, of the i-th unit 
oLUS 


Theorem 1 


[1 Teena] Tei? for 1<k+ 1: 
Westsi = 


MK tkel) 5 lie alee ile (4) 
where 
= TT 4 
we for k=n, 
Tt 
= (n+1;n+1) 
Rei) a 

= , for k>n+1. (5) 


The second-order inclusion probabilities can be calculated 
iteratively using the following theorem: 


Theorem 2 (Chao, 1982) 
Mi, i) 
{1 Tg Rey * Repl Meri» for t<ji<k, 
Tey U1 —Re-agy) Me-1y , for i<j=k. 
Bethlehem and Schuerhoff (1984) give a sufficient and 


necessary condition for the second-order inclusion proba- 
bilities to be strictly positive for a population U,: 


# {i:1< and Ne) =1} #n- 1, for? such thatn <0 <k. 


Since T.) < 1 for all i and ? such that i < 0 < k, this 
condition is always met. Therefore, within the framework of 
this article, we will never have zero second-order inclusion 
probabilities. 


Moreover, the quantity Avy., ) is always negative if we use 
the Chao plan (Chao 1982, p. 656). Then the Yates-Grundy 
variance offers the advantage of always being positive. 


3. APPROXIMATION OF SECOND-ORDER 
INCLUSION PROBABILITIES 


The following theorem provides us with an asymptotic 
expression for second-order inclusion probabilities for the 
Chao plan. 


Theorem 3 
m= ae 
ORL Ol ora oe sif jon Ie 
Po 
Twig) ~ RT Ss ote a 1 
n+l3i n+1sj ake é 
TO) 1 Oia ae aan ify <n +155 (6) 
(n+1;i) ““(n+1;j) 
where Py = M.A and i<j. 


The proof of this theorem can be found in Appendix I. 


Note that this approximation has a different structure 
depending on whether j > n + 1 orj < n + 1. To avoid this 
problem, we will use a plausible condition for the auxiliary 
variable so that these two structures will be equivalent. Let us 
consider the hypothesis given in the introduction, that the 
values of the auxiliary variable show little dispersion for the 
first units n + 1 of the population. More precisely, we assume 
that the auxiliary variable is constant for the first units n + 1, 
Len 


n ‘ 
Te er Ola ee 
Dion. o 


In this case, 


Masi * Mary) _ n= 


Marti Marisp MS Moneta) 
By using (6), we have the following approximation for 
second-order inclusion probabilities 


n- se ieee 
OU) AOC) emma ifi <j; (7) 
(j) 
where 
ae Mie > he Gehan Ale 
OP |x if jsn+l. (8) 


(n+1;j)? 


4. VARIANCE ESTIMATOR 


Relation (7) leads to the following approximation for 
A 


(N;i,f)* 
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A = Pir 1 aes 
A wii ~ Twi) “wv; 2 sean | at ae (9) 
EG 


(2), (7) and (9) provide an asymptotic expression for the 
Yates-Grundy estimator. 


ve 


De) Aleepyesye Yo 


= 1) jes, icSysi<j 


y, | 
Halse ». (10) 


Ty » Twa 


But this expression tends to underestimate the variance. In 
fact, to establish relation (6), we use approximation (19) from 
Appendix I. This approximation always implies that: 


a= Il 
Novi. < Man ™ay — 11 

(Ninf) ODD = D5 (11) 

This can easily be verified if we observe that (20) is 

obtained from (18) using approximation (19). Inequality (11) 

is therefore true for j >n + 1. Forj < n + 1, it is sufficient to 

observe that (21) is also obtained from (19). Inequality (11) 
implies that: 


=A ne Lp. 
(aD) = G) (12) 
Tni,f) n-1 


given that Aw.;, < 0. From (2), (10) and (12), we have 
effectively 


ete ea 


To overcome this problem of variance underestimation, we 
plan to make an adjustment on (9). It is well known that: 


N 


» TNs.) 


i=l;isj 


n- 1)Ty.p. (13) 


Approximation (7) does not abide by constraint (13). The 
adjustment involves assuming that the p,, are unknown and 
selecting them so as to satisfy (13) for the second-order 
probability approximation, 7.e.: 


i 
wir ths 
2. Mw:i) 5 — » ENDO) 
i=l iat 7 i=j+l = 
(n- 1) Xj 
This constraint can be written as follows 
Jel N iD 
Gee 
De Teen t 2. TN = nae ee PS * (14) 
i=] i=j+1 Pw 


Given that Yj’, %wy,) =, constraint (14) is practically 
verified if 


Pa =~ Mw (15) 
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Relation (16) is plausible given that the difference between 
the left and right sides of (16) has as its lower bound 


N 
1 
= oe Tn TMaviy ~ Marj] > 


and as its upper bound 


N 


1 
ae Tne [May — Tw; - 


n— 1 i=j+1 


These two bounds are close to zero when the Ty.) show 
little dispersion. This means that solution (15) is appropriate 
when the Ty.) are small. Furthermore, the greater the value of 
J, the closer the two bounds are to zero. Therefore, solution 
(15) verifies (13) all the more as j is large. This implies that 
our approximation (9) is very good for the duplicate pairs (i, 7) 
(i < j) such that the unit 7 is located at the end of the 
population. In fact, we want approximation (9) to be the best 
for the duplicate pairs (i, j) whose presence in the sample is 
highly probable (i.e., for the pairs (i, j) (<j) for which Ty.» 
is the largest). It is therefore preferable to place the units 
having high first-order inclusion probabilities at the end of the 
population. 

If we choose to have py = Tyy.i) , we have p, smaller than 
(8). This leads to a larger variance approximation. This 
solution is all the more acceptable as it corresponds to the 
result of the simple plan without replacement. In fact, if we 
replace within (7)Twy.), Ty, ) and p,, by n/N, we obtain 


a eae Paiiete> ne ls 

N(N - 1) 
This expression corresponds, quite clearly, to the result of the 
simple plan without replacement. 

In conclusion, we approximate Avy, ;, through (9) with 
Pw = Tw). We assume that the population is ordered in such 
a way that the units having small 1, are located at the 
beginning of the population and that the units having large 
Tix) are located at the end of the population. We also assume 
that the 7. do not show too much dispersion for the first 
units n + 1 of the population. 


5. COMPARISON WITH OTHER PLANS 


Instead of comparing the second-order inclusion proba- 
bilities, we will compare the quantities - Avy,; )/T,, ;) Which 
are of some use in calculating the Yates-Grundy variance. We 
will examine what these quantities provide for the Chao plan, 
the randomized systematic plan (Hartley and Rao 1962), the 
rejective plan (Hajek 1964) and the Rao-Sampford plan (Rao 
1965, and Sampford 1967). 
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Theorem 4 
MS ia 
N; 
Se he , For the Chao plan; 
=k 
“Away _ 1 Ky. ~ Rey. for the randomized 
=< ity Wee eee A systematic plan; 
Te n-1 
(N3t,]) 


for the rejective plan and 


nll ~ Rayolll - Rey.) 
» the Rao-Sampford plan. 


d(n-1) 


The proof of this theorem can be found in Appendix II. 


It is important to note that the proposed approximation for 
the randomized systematic plan comes from Deville’s 
approximation (p. 21) and not from the famous Hartley-Rao 
approximation (1962). We were not able to use the Hartley- 
Rao formula because the latter is based on the asymptotic 
hypothesis, n fixed and N — ~, which is different from that 
adopted in this paper. 

We observe that if the Mj, are small,-Aw,; )/Tey.; pis 
equivalent for the Chao plan and for the systematic plan. 
However, we observe that - Aw,; / Tq, 18 always smaller in 
the systematic case than it is in the Chao case. This is 
certainly due to the fact that the approximation for the 
systematic plan underestimates -Awy,; )/Ty.,). This can be 
confirmed by replacing Tw, and Ty.) by n/N. We then have 

it Awsisj) fs Neen 

Twig N@- 
for the randomized systematic plan. This is equivalent to a 
simple plan, thus 

“Aww. be! ake 

Tap NUE) 
We intend to adjust the approximation of —Aavip/ Tv, for 
the systematic plan by multiplying it by 


N-n 1- 
N-2n 1- 


where f = n/N is the sampling rate. 


The approximation of - A.w,; )/Tw.i, , for the Chao plan is 
also of the same magnitude as that of the rejective plan. In 
fact, if the Tw. are small, we have the approximation 


n(l- ty. : n[l- Twy,y] 
Se 
[1 - My.) a TNs) 
= 


Therefore, the Yates-Grundy estimator is approximately the 
same whether we use the Chao plan, the randomized sys- 
tematic plan, the rejective plan or the Rao-Sampford plan, for 
large n and small Ty». 


6. NUMERICAL EXAMPLES 


The two following examples correspond to two extreme 
cases. In the first example, the Tw, show little dispersion; in 
the second, they show much more dispersion. Let us consider 
a small sample of size 20. The population size is 50 so that the 
Ty.i) are not too small. We have willingly opted for a bad 
situation in order to show that even with a sample of size 20 
and a small population, the asymptotic results nevertheless 
represent a good approximation. 


Example 1 


Let us consider the first-order inclusion probabilities 
represented in Figure 1. 


ee ST 2 coe OD 

TE ie es Seg UA, Ne O16) 109 0 a Se ee 

Figure 1. First-order inclusion probabilities in the case of 
Example 1 


Figure 2 shows, on the Y axis, the true values of 
~Aw.i,j/Tw.i,) for the Chao plan and, on the X axis, the 
approximations. We have also represented the straight line 
where the approximations are equal to the true values. The 
approximations are all the better as the points are close to the 
straight line. 


0.045 
0.04 
0.035 
0,03 
0.025 


0.02 
0.02 0.025 0.03 0.035 0.04 


Figure 2. Approximations and true values of ~ Aw, j/ Tevei, p» in the 
case of Example 1 


We have a mean error of - 0.000569 with a standard devia- 
tion of 0.0015996. This is very small in relation to the order 
of magnitude of the approximations. The centre of gravity of 
the scatter plot is located in (0.0313; 0.0318). It might seem 
surprising that there are less points at the left of the centre of 
gravity than at the right. This is simply due to the fact that 
most of the points at the left of the centre of gravity overlap. 
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We observe that the pairs (i, j) with i <j such that Tw.» is 
large correspond to points located on the left. They are the 
pairs showing the best approximation. Moreover, there is a 
high probability that these pairs are located within the sample 
given that Tj.) is large. Therefore, our approximate variance 
(10) is definitely acceptable. 


Example 2 


The first-order inclusion probabilities are given in Figure 
3. Here we notice that these probabilities are more dispersed 
than in Example 1. Figure 4 provides the true values as well 
as the approximations of -Avw,; /Tw.i, - 


BS Se) Gas 
Saanou#o#+rtty 


Figure 3. First-order inclusion probabilities in the case of 
Example 2 


ie) 0.01 0.02 0.03 0.04 0.05 


Figure 4. Approximations and true values of - Aw, )/Tw;i,), in the 
case of Example 2 


We have a mean error of--0.006999 with a standard 
deviation of 0.006438. The centre of gravity of the scatter plot 
is located in (0.02957; 0.036606). 

We reach the same conclusion as in Example 1. The 
second example leads to worse approximations. This is 
simply due to the high first-order inclusion probabilities. 


7. CONCLUSION 


The Chao plan provides a number of advantages: (i) it is 
sequential; (ii) the second-order inclusion probabilities are 
positive; and (iii) the Yates-Grundy variance is always 
positive. On the other hand, the second-order inclusion 
probabilities are difficult to calculate. That is why we propose 
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to approximate them. We have observed that this approxi- 
mation is better when the beginning of the population consists 
of units having small T,y.,, and the end of the population 
consists of units having large Ty,.,. We have compared our 
approximation with other approximations provided for the 
randomized systematic plan, the rejective plan and the Rao- 
Sampford plan. We have concluded that these approximations 
are equivalent if the first-order inclusion probabilities are 
small and if the size of the sample is large. The two numerical 
examples which close this paper confirm the sound results of 
our approximation. 


APPENDIX I 


Proof of Theorem 3 


Before proving this theorem, we will demonstrate the 
following two lemmas. 


Lemma 1 
Mi) =P I} Meo 1}; 
where 
: Tei. if t>n+1; 
Pw = ‘ ; : 
Tn +:i) ifs i <n 1: 
Dia ltieleeiiy snl: 
Qa 
ite Oates n+ 1 (17) 
Lemma 2 
Maki) =) I} Teo 21; 
where i <j, 
he 
: muwoFualt~ 2] it jie + 1; 
doy 


Vapi ee ie en i: 


and a; is defined by (17). 


Now, with these two lemmas, we can demonstrate 
Theorem 3. 


Proof of Theorem 3 


Case 1: If j >n+ 1, using Lemma 2, we have 


1) 7 p) 
Tvs.) ~ ™G-13) mal! a 1) i h ~ Ten 2 


=j+l 
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On the basis of Lemma 1, this last expression becomes 
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and by regrouping certain terms, we obtain 
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If n is sufficiently large 
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Berger: Asymptotic Variance for Sequential Sampling Without Replacement With Unequal Probabilities 


Finally, on the basis of Lemma 1, this last expression can be 
written: 
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Case 2: If j < n+ 1, Lemma 2 provides 
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By using approximation (19), we obtain 
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On the basis of Lemma 1, we obtain finally 
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Manel) Mn+: 
Q.E.D. 
APPENDIX II 
Proof of Theorem 4 


* For the Chao plan, it is sufficient to use (6), (9) and (15). 
¢ For the randomized systematic plan, it is sufficient to use 
the approximation of the Twy,, ) given by Deville (p. 21) 

7 ix 1 
Tid ~ Mai 7 


; (22) 
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This expression is obtained from the hypothesis 
Te 
Max, -;.y {aaa - 0. 
n 


This last hypothesis is verified since n - ~. 
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¢ For the rejective plan, using Hajek's result (1964, p. 1508), 
we have 


“Awin , U=Tal - Marl 


; (23) 
Ti 2 O- Tay] ll - teal 


for d - ~. We note that (23) remains valid for the Rao- 
Sampford plan (see Hajek 1981, Theorem 8.2, p. 82). Using 
the approximation (Hajek 1964, p. 1521), 


2 = aS n 
{d ‘ {1 Twi] (1 Tw. JI ss d(n 1) ? 
we obtain the result of the theorem. 
Q.E.D. 
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Applications of Spatial Smoothing to Survey Data 


ANN COWLING, RAY CHAMBERS, RAY LINDSAY and BHAMATHY PARAMESWARAN! 


ABSTRACT 


In this paper we present two applications of spatial smoothing using data collected in a large scale economic survey of 
Australian farms: one a small area and the other a large area application. In the small area application, we describe how the 
sample weights can be spatially smoothed in order to improve small area estimates. In the large area application, we give 
a method for spatially smoothing and then mapping the survey data. The standard method of weighting in the survey is a 
variant of linear regression weighting. For the small area application, this method is modified by introducing a constraint 
on the spatial variability of the weights. Results from a small scale empirical study indicate that this decreases the variance 
of the small area estimators as expected, but at the cost of an increase in their bias. In the large area application, we describe 
the nonparametric regression method used to spatially smooth the survey data as well as techniques for mapping this 
smoothed data using a Geographic Information System (GIS) package. We also present the results of a simulation study 
conducted to determine the most appropriate method and level of smoothing for use in the maps. 


KEY WORDS: Kernel estimation; Mapping survey data; Small area estimation; Survey weighting. 


1. INTRODUCTION 


The Australian Bureau of Agricultural and Resource 
Economics (ABARE) is the applied economic research 
organisation attached to the Department of Primary Industries 
and Energy. Amongst its information gathering activities, 
ABARE conducts annual surveys of selected Australian 
agricultural industries which provide a broad range of 
information on the economic and physical characteristics of 
farm business units. 

The largest survey is the Australian Agricultural and 
Grazing Industries Survey (AAGIS), which covers farm 
establishments with an estimated value of agricultural opera- 
tions (EVAO) of $A22,500 or more in the last agricultural 
census that are classified to one of the broadacre industries — 
that is, cereal crop production, beef cattle production, and 
sheep and wool production. For the last two years, around 
1650 farms have been included in the AAGIS sample, which 
is stratified by geographic area, industry, and EVAO. The 
sample farms are located throughout Australia with a 
non-uniform density. The latitude and longitude of the sample 
farms (defined in terms of the location of the farm “gate’’) is 
recorded as a regular part of the collection. This knowledge 
of the location of the surveyed farms enables the spatial 
smoothing techniques described in this paper to be used. 

Traditionally, AAGIS estimates have been presented only 
as tables of numbers showing averages for all Australia, each 
state, and industries within states. However, the concern of 
rural industry and government about the combined impact of 
drought in some areas of Australia and the decline in certain 
commodity prices has highlighted the need for timely and 
detailed information on regional trends in farm performance. 


In particular, there has been a perceived need for information 
which portrays the spatial distribution of farm performance, 
reflecting actual variability in climate and production across 
Australia. 

A highly effective way of presenting information on a 
spatial basis is to map the regional variation in economic 
performance of the surveyed farms. We use a nonparametric 
regression method to spatially smooth the farm level survey 
data, which is then presented in the form of a map. Recent 
improvement in computing power and the availability of high 
quality and affordable GIS packages have made this form of 
presentation a practical alternative to the traditional tabular 
method of presenting survey results. 

Maps have been found to be a successful form of 
exposition for a number of reasons. First, estimates presented 
in a map are easily interpreted; when presented with too many 
tables it is very easy for a client to overlook local variations 
or be “swamped” by numbers. Next, maps make it easy for a 
client to relate the geographic variation in one variable with 
that of another. Finally, a colour map has great visual impact. 

This demand for information on a spatial basis has resulted 
in an increased emphasis on small area estimates. One method 
of small area estimation (which originated naturally from 
smoothing survey data for presentation in maps) is to spatially 
smooth the sample weights. This reduces the variability of the 
small area estimates. 

In Section 2, we examine a method of integrating 
geographical location into ABARE’s survey weighting 
methods in order to make our small area estimates less 
variable. It is applied to sub-regional estimation within two 
Agricultural Regions in Section 3. In Section 4, we describe 
how kernel regression techniques can be used to produce 
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maps which give a good indication of the local geographic 
variation of a surveyed variable. Two methods of mapping the 
smoothed data are discussed, both of which use ARC/INFO, 
a GIS software package. The results of a simulation study 
comparing various kernel regression methodologies for use in 
ABARE’s maps are summarised in the Appendix. 


2. SMALL AREA ESTIMATION BY 
SPATIALLY SMOOTHING 
SAMPLE WEIGHTS 


The standard method used to compute sample weights at 
ABARE is described in Bardsley and Chambers (1984). It 
rests on the assumption that at some appropriate level of 
aggregation (say, Agricultural Region) the variable Y follows 
a linear model of the form 


Y=XB+V (2.1) 


where Y is the N-vector of values of Y at this level of aggrega- 
tion, X is a N x p matrix of values of a set of p benchmark 
variables, 3 is an unknown p-vector of regression coefficients 
and V is a N-vector of errors satisfying E(V) = 0 and 
var(V) = 67Q, where o is an unknown scale parameter and 
Q is a known N x N diagonal matrix having as its elements the 
measure of size of each farm, EVAO, introduced in the 
previous section. 

Since this model is a multipurpose model, with the same 
set of benchmark variables used for each survey variable, the 
column dimension, p, of X is usually large. Typically, X 
consists of between 3 and 7 variables related to the main 
agricultural commodities produced by farms in the region 
together with dummy variables indicating industry strata 
within the region. Best linear unbiased estimation of the 
population total of a survey variable on the basis of such an 
overspecified model typically results in weights that are 
highly variable and often negative. 

As discussed in Bardsley and Chambers (1984), negative 
weights are highly undesirable in a multi-purpose survey like 
AAGIS. In particular, such weights can lead to negative 
estimates of intrinsically positive quantities. This problem has 
been pointed out in the literature a number of times (see for 
example, Deville and Saérndal 1992; Bankier, Rathwell and 
Majkowski 1992; and Fuller, Loughin and Baker 1994). The 
method used at ABARE to control for strictly positive sample 
weights is based on the ridge-type modification to the best 
linear unbiased weights suggested by Bardsley and Chambers 
(1984). 

Given a sample of size n from a particular region, the ridge 
weighting approach determines the sample weight vector w 
by minimising the mean squared error criterion 


QO =) 'B'CB + (w - 1)’ ww - 1). (2.2) 


Here B =T-x'w is a p-vector of benchmark biases, 
corresponding to the differences between the (known) 


population totals T of the p benchmark variables making up 
X and the corresponding survey estimates x’w of these 
totals, C isa pxp diagonal matrix of non-negative relative 
“costs” associated with these biases, is the sample 
component of Q, x is the sample component of X, 1 is a 
n-vector of ones and A is a scaling constant which is chosen 
by the survey analyst. The value of w minimising Q is 


w=1+o0!x(AC! +x%w x) 1 (T - x71). (2.3) 


The scale constant A is called the ridge parameter 
associated with these weights. As A increases from zero, the 
sample weights in w move away from their best linear 
unbiased values under the model (2.1) (namely, their values 
at A = 0) and become less and less variable. That is, as A 
increases, the variances of the survey estimates based on 
these weights decrease. On the other hand, as A increases, 
these estimates become more biased under (2.1), so the 
components of B move away from their zero values at A = 0 
(where the sample weights define unbiased estimates under 
(2.1)). These components become larger and larger (in 
absolute terms) as A increases. 

The survey analyst makes a tradeoff between these two 
competing sources of “error” by choosing the smallest value 
of A such that the sample weights in w stabilise at strictly 
positive values as close as possible to their best linear 
unbiased values under (2.1). This ensures that the components 
of B are as small as possible subject to this stability 
requirement. At ABARE, the value of A is chosen so that the 
sample weights are at least unity. 

Recent small area estimation research in ABARE has 
focussed on a method of modifying this ridge weighting 
procedure to create sample weights that are less spatially 
variable. We achieve this by modifying the mean squared 
error criterion Q in (2.2) to include a constraint on spatial 
variability, while continuing to regard the elements of the 
variable Y as being independent. 

Let K be ann x n matrix reflecting Euclidean distance 
between sample farms, such that K is symmetric and 
non-negative, K,,=1 for all i and K,, | 0 as the distance 
between farm i and farm) increases. Put u = w - 1. The aim 
is then to choose u so that when K ; 18 large, the difference 
between u, and u, is small. That is, we seek to minimise a 
quantity of the form 


DOD? ACH ACI GED) 4 MOLY: 
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where (uw), =(u,)’. An appropriate modification to the 
mean squared error criterion (2.2) leads to minimisation of 

O* =) 'B'CB +u™owu + (u®)'K1 - uTKu. 
Minimising with respect to u leads to 
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provided n7! exists, where 


n = diag(K1)-K+o. (2.5) 


Clearly, then, 
w=1+n!x(AC1 +x !x) '(T - x71). (2.6) 


It can be seen that the modified mean squared error 
criterion Q* equally weights the spatial smoothness criterion 
given in (2.4), and the term corresponding to the variance of 
the prediction error of the sample estimates, u7wu. As the 
scale of K was arbitrarily specified, the comparative 
weighting of the two criteria must be modified by “scaling 
up” the spatial matrix {diag(K1) - K} by a factor @in order 
to make it comparable in size with the heteroscedasticity 
matrix @, and by adding a parameter a, 0 < a < 1, to the 
expression for n in equation (2.5), so that 


1 =(1 - «) O{diag(K1) - K} + aw. 


These spatially smoothed sample weights can be derived 
in a second way, providing deeper insight into how they 
should be interpreted. This follows from noting that 


2 
Sint », K,, ~Ky, “ ~K,, 
2 
7 K,, oe ys K,,, -K,,, 
N= m+2 
-K, ~Kiy ie o, ys Kim 


mé*n 


can be expressed as 7 = S R S, where S is a diagonal matrix 
with S ,, = (0; + Yee: Kim) » and R is a correlation matrix 
with 


1 if i=j 


Ra =%4 
y -K,{{ 07 Ky} (07 x, if iey. 
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Thus the spatially smoothed sample weights can alter- 
natively be derived as ridge-type regression weights based on 
the assumption that the variable Y follows a linear model of 
the form (2.1), with V redefined as satisfying E(V) = 0, 
var(¥;)=0; + 4.; Kin» and cov(Y;,¥,) = -K, for i+. 
The usual ridge weighting procedure then leads directly to 
(2.6) with n defined by (2.5). Note that under this implied 
model neighbouring farms are negatively correlated. 

This second method of derivation shows clearly that the 
introduction of spatial smoothness for the survey weights is 
at odds with standard concepts of statistical efficiency as far 
as estimation at the aggregate level is concerned. Since the 


WF 


spatial correlation between neighbouring farms will typically 
be positive, efficient survey estimation at the aggregate level 
will involve weighting based on (2.3) with w replaced by a 
non-diagonal variance/covariance matrix reflecting this 
positive spatial correlation. These are not the weights that 
result when one imposes as spatial similarity constraint. 
Consequently, one could expect that such “large area 
efficient” weights would tend to be more dissimilar for 
neighbouring farms than they would be for farms that are far 
apart. That is, there is a price to pay in weighting — if less 
variable aggregate level estimates are required, then this tends 
to lead to more variable small area estimates. Conversely, if 
(2.6) is adopted as the method of weighting because of its 
desirable small area properties, then it can be expected that 
aggregate level estimates obtained by summing these small 
area estimates will be less efficient. 

The spatially smooth sample weights (2.6) have been 
implemented using 


K , = exp(-dllz; -z,|l), (2.7) 


where |z, - z; | is the distance between farm i and farm j and 
d is a constant controlling the radius of circle around the i-th 
farm within which spatial smoothing is applied. The smaller 
the value of d, the larger the radius of spatial smoothing. At 
present, the “scaling up” constant @ is computed as the ratio 
of the determinants of the K and w matrices, raised to the 
power n’*. An empirical evaluation of this method is 
described in the following Section. 


3. AN APPLICATION OF SPATIALLY 
SMOOTHED SAMPLE 
WEIGHTING 


Initial results from an evaluation of the first method of 
spatially smoothed ridge weighting described in the previous 
section are set out in Tables 1 to 3. These results are for two 
Agricultural Regions. The first, Region A, is in New South 
Wales. In spatial terms, this region is relatively homogeneous, 
being located in the southwestern corner of the state. The 
principal agricultural activities are wheat and rice production 
and wool and lamb production. The second, Region B, is in 
Western Australia. This region is more spatially hetero- 
geneous, ranging from established cropping and wool pro- 
duction farms in the central west of the state to much larger 
livestock and cropping farms on marginal farming land in the 
south east of the state. The principal agricultural activities are 
wheat and legumes production and wool production. 

Six variations of the spatially smoothed ridge weights (2.6) 
with K given by (2.7) were used in the evaluation, defined by 
values of d = 0.05 (weak spatial effects) and d = 0.005 (strong 
spatial effects), and values of « = 0.9 (most emphasis on the 
standard ridge weights), « = 0.5 (equal emphasis on standard 
ridge weights and spatially smooth weights) and « = 0.1 (most 
emphasis on spatially smooth weights). 
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Table 1 
Values (in relative percentage terms) of the biases associated 
with estimation of the benchmark variables corresponding to 
the principal agricultural commodities produced in Region A 
(sample size n = 101 farms) and Region B 
(sample size n = 85 farms) using the standard ridge weights (2.3) 
and the spatially smooth ridge weights (2.6) 


Wheat Sheep Rice 
Region A 
Standard ridge weights -0.50 5.0 13.0 
Spatially smoothed ridge weights 
d=0.05 a=0.9 -0.50 4.6 UNS) 
w= 05 -0.46 4.7 12.4 
a=0.1 0.07 6.2 17.4 
d=0.005 a=0.9 -0.40 49 4) 
a=0.5 0.80 8.9 28.0 
C7 —10nl 9.20 25.0 60.0 
Wheat Sheep Legumes 
Region B 
Standard ridge weights 0.43 = 235) 1.49 
Spatially smoothed ridge weights 
d=0.05 a=0.9 0.42 -1.16 137 
p= (05) 0.44 -1.14 1.40 
r= (OI) 0.69 = L925) 2.53 
d=0.005 a=0.9 0.50 = 1.20 1.68 
a=0.5 ilesiil 1.14 9:73 
a=0.1 26.57 19.61 45.46 


Table 1 shows the relative biases associated with esti- 
mation of the population totals of the main commodity related 
benchmarks for each region under these different weighting 
systems, as well as the corresponding biases associated with 
the standard ridge weights. The increase in these biases as the 
amount of spatial smoothing in the weights is increased is 
evident. Since these production benchmarks are positively 
correlated with most of the economic variables measured in 
the survey, these benchmark biases can be expected to be 
translated into a corresponding upward bias in survey 
estimates based on these weights. 

Figures | to 4 show the difference between the smoothed 
weights and the standard ridge weights for the two “extreme” 
combinations of « and d in both regions changes as the size 
(measured in terms of the logarithm of the estimated value of 
agricultural operations, or log(EVAO)) of the sample farms 
changes. 

Observe that for relatively strong spatial smoothing 
(Figures 1 and 3), the effect of smoothing is to increase the 
weights of most of the larger sample farms, while dramat- 
ically decreasing the weights of a small number of smaller 
sample farms. Weak spatial smoothing (Figures 2 and 4) 
changes the weights much less, and there is little relationship 
between the size of the farm and the direction of weight 
change. Consequently, an upward shift in survey estimates 
for these regions could be expected with the introduction of 
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Figure 1. Difference between smoothed weight with « = 0.1 and 
d=0.005 and standard ridge weight, Region A 
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Figure 2. Difference between smoothed weight with « = 0.9 and 
d=0.05 and standard ridge weight, Region A 
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Figure 3. Difference between smoothed weight with a = 0.1 and 
d=0.005 and standard ridge weight, Region B 
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Figure 4. Difference between smoothed weight with « = 0.9 and 
d=0.05 and standard ridge weight, Region B 


strongly spatially smoothed sample weights. Given the 
increased positive biases indicated in Table 1, this upward 
shift would be expected to be essentially due to the intro- 
duction of a positive bias in these estimates. 

Is this increased bias compensated for by a lower standard 
error? To evaluate this question, survey estimates and 
estimated standard errors were computed for a key financial 
variable, total cash costs. These estimates are set out in 
Table 2 (Region A) and Table 3 (Region B). Estimates are 
provided both for each region and for small areas within each 
region, denoted SR-i in the table, with the index i ranging 
between 1 and 6 for Region A and between 1 and 7 for 
Region B. 


Table 2 
Estimates (with corresponding estimated standard errors in 
parentheses) of the average value of Y = total cash costs 
in subregions SR-1 to SR-6, making up Region A 
(sample size n = 101 farms), using the standard ridge 
weights (2.3) and the spatially smooth ridge weights (2.6) 


Spatially smoothed ridge weights 


Standard 
weights d=0.05 


d=0.005 


SR-1 100,618 100,453 101,297 107,263 102,059 112,635 135,419 
(24,551) (24,511) (23,906) + (20,487) + (23,474) (18,923) (18,011) 


SR-2 115,320 115,417 116,002 120,362 116,917 126,165 153,707 
(26,754) (26,661) (26,448) (25,637) (26,423) (25,990) (27,975) 


SR-3 167,524 167,453 167,486 168,257 167,709 170,781 187,683 
(28,479) (28,467) (28,473) (28,426) + (28,175) (26,471) (24,211) 


SR-4 182,940 180,317 177,838 163,556 176,257 174,077 192,296 
(106,471) (105,485) (101,012) (74,418) (97,823) (69,109) (43,651) 


SR-5 132,050 132,083 132,389 134,786 132,490 136,369 151,046 
(25,089) (25,096) (25,154) (25,475) (25,173) (24,410) (23,110) 


SR-6 132,493 132,184 133,204 141,623 133,763 147,652 192,781 
(44,385) (44,546) (44,757) (46,736) (45,078) (46,953) (53,105) 


RegionA 134,114 133,807 134,141 137,080 134,506 142,040 166,432 
(15,691) (15,655) (15,426) (13,845) (15,199) (13,494) (12,815) 
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Table 3 
Estimates (with corresponding estimated standard errors in 
parentheses) of the average value of Y = total cash costs 
in subregions SR-1 to SR-7, making up Region B 
(sample size n = 85 farms), using the standard ridge 


weights (2.3) and the spatially smooth ridge weights (2.6) 
ee ee eS ee ee 


Spatially smoothed weights 


Standard 
weights d=0.05 


d=0.005 


SR-1 183,194 183,262 183,528 186,151 184,287 195,138 257,652 
(64,851) (64,325) (64,051) (64,967) (64,132) (69,859) (59,518) 


SR-2 261,952 261,487 261,119 261,182 261,938 276,912 331,805 
(70,989) (70,601) (70,502) (73,131) (70,723) (79,751) (67,356) 


SR-3 113,499 113,441 113,742 116,847 114,631 125,525 157,007 
(30,304) (30,289) (30,255) (30,731) (30,377) (31,507) (32,500) 


SR-4 242,220 242,182 242,208 242,221 242,163 242,439 250,871 
(26,160) (25,671) (26,159) (26,160) (26,154) (24,244) (24,836) 


SR-5 134,524 134,970 135,700 139,122 134,734 131,448 148,629 
(32,420) (32,528) (32,432) (30,607) (32,202) (27,867) (27,942) 


SR-6 176,540 176,977 175,708 163,241 172,076 148,434 171,856 
(60,377) (60,703) (59,214) (46,361) (55,925) (36,218) (39,527) 


SR-7 205,287 205,644 205,433 202,039 204,519 194,998 219,959 
(44,137) (44,008) (43,963) (44,044) (43,972) (45,434) (51,690) 


RegionB 176,283 176,342 176,397 176,822 176,294 179,998 216,445 
(19,039) (18,869) (18,874) (18,213) (18,511) (18,540) (17,099) 


It is seen that, in general, the answer to the question posed 
above is yes. The estimated standard errors of the survey 
estimates decrease as the degree of spatial smoothness of the 
weights increases (from left to right across the tables). 
However, as expected, the estimates themselves also increase 
in size, becoming more and more positively biased. Overall, 
the gain due to reduced standard error seems to cancel out the 
increase in bias, except for the heaviest spatial smoothing 
(a = 0.1, d = 0.005). In this latter case the increase in bias 
outweighs the reduction in standard error. The choice « = 0.1 
and d = 0.05 seems a good compromise, leading to reasonable 
(but not spectacular) bias-variance tradeoffs in Region A, and 
little change in the estimates in Region B. 


4. ESTIMATION AND MAPPING 
OF LOCAL AVERAGES 


A survey data map is a two-dimensional surface which 
estimates the spatial mean function of the survey variable in 
the population. In practice, such a map is obtained by 
applying a nonparametric regression technique to the 
weighted unit record data obtained in the survey. 

At ABARE, we use kernel regression (a nonparametric 
technique) to produce maps which show the spatial varia- 
tion of the estimated spatial mean function surfaces of key 
survey variables. These surfaces are obtained by replacing 
the observed sample values of these variables by locally 
weighted averages. In addition, for each local average map, a 
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corresponding map is produced which shows an estimate of 
the local variability of the variable of interest. We give below 
a brief outline of the technique: for clarity of exposition we 
deal only with the univariate case. See Ruppert and Wand 
(1994), Wand and Jones (1995, p140), and the references 
therein, for discussion of the multivariate case. 

We assume that the finite population is generated as an iid 
sample {(Z,,Y,),i = 1,...,N} froma super population where 
Y, is the value of a response variable Y observed at location Z,. 
We suppose that the observations follow the model 


Vicor (Zable, Meier Ten 


where m(z) = E(Y |Z =z) is the conditional mean of Y given 
Z, and the €, are independent random variables with zero 
mean and variance o7(z) . Suppose that the error terms €, are 
independent of the process by which the sample is selected, 
so that the sample values {(Z,,Y,),i=1,...,n} follow the 
same model, and write f for the density of Z,, ..., Z,, . 
A natural choice for the local average at any point z is then 
the mean of the values of the response variable for those 
observations with locations close to z, since observations 
from points far away will tend to have very different mean 
values. The local average is defined as a weighted mean 


(2) =e Wig. 
i=] 


where the weights {W,(z)} depend on the locations {Z;} of 
the sample observations, and 77(z) estimates m(z). 

The weights are constructed using a function K known 
as the kernel, which is continuous, bounded, symmetric 
and integrates to one. Various weight sequences have 
been proposed: the traditional Nadaraya-Watson weights 
(Nadaraya 1964 and Watson 1964) are 


(nhy "> K{(@-Z yin} |, 


j=1 


W,(z) =h 1 K{(z - zim | 


where / is a scale factor known as the bandwidth. The kernel 
function K gives an observation close to z relatively more 
influence on the local average at this location than it gives to 
an observation further from z. 

Where observations are sparse, a fixed-bandwidth window 
may contain few points and the corresponding estimator may 
therefore have a very high variance. This may be avoided 
by using the k-nearest-neighbour method in which a different 
bandwidth is used at each estimation point z. The band- 
width at z is the distance to the k-th nearest neighbour of z, so 
that there are always exactly k points in the bandwidth 
window. Let h, be the distance between z and its k-th 
nearest neighbour. The k-nearest-neighbour Nadaraya-Watson 
weights are 


(nh) ' DK { (2 -Z)/h,} |. 


j=l 


Wy, () = hy K{(@-Z)/hy} | 


We show in Table 4 the asymptotic mean squared error 
(MSE) properties of the usual (fixed-bandwidth) and 
k-nearest-neighbour estimators as given in Hiardle (1990, 
p. 46). 


Table 4 
Asymptotic bias and variance of Nadaraya-Watson estimators; 
Cx= J K’(u)du, dy = { u’K(u)du 


Fixed-bandwidth k-nearest-neighbour 


Bias prln'f + 2m'f')X) 4 4) * (m"f + 2m'f')) | 
2f (x) x n 8f3(x) K 
Variance Oe) . 20°(x) e 
nhf(x) * eee 


Clearly, the bias of the estimated regression function can 
be reduced by using a smaller bandwidth h (number of 
nearest-neighbours k), but this leads to a noisy estimate 
with local detail masking global features of the curve (/ has 
high variance). If h(k) is large, m is smoother but the global 
features are dampened (7 has high bias and low variance). 
The bias, then, can only be reduced at the expense of variance 
and vice versa, with the bandwidth h determining the ratio of 
(squared) bias to variance. 

In reality, the survey design and the spatial distribution of 
a survey variable Y will not be independent, so simple local 
averages for Y derived from the sample data will be 
misleading as estimates of the local population means of this 
variable. To overcome this problem the kernel weights are 
multiplied by the survey weights to get the final smoothing 
weights used for calculating the local average. This is 
equivalent to estimating the local population mean m(z) of Y 
under the assumption that it is locally linear in the same 
benchmark variables as those used to model the overall 
population mean of Y. 

A wide array of alternative kernel smoothing procedures 
have been discussed in the literature. As well as various 
sequences of smoothing weights {W,}, there are different 
types of bandwidths, and several automatic bandwidth selec- 
tion methods. A simulation study was therefore conducted to 
determine the most appropriate kernel methodology for use in 
ABARE’s maps. This is described in the Appendix. 

Uncertainty about the estimate of the spatial mean derived 
via kernel-based spatial smoothing can be represented by 
mapping the local variability of the variable of interest. Areas 
of high local variability correspond to areas where the map of 
the mean function is less precise and vice versa for areas of 
low local variability. 

The usual method of determining confidence regions for 
a kernel curve estimate is the bootstrap; see Hardle (1990), 
Hall (1992), and references therein. However, for com- 
putational efficiency, we use the expectiles (Newey and 
Powell 1987) of the spatial distribution of Y to describe this 
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Key: 

[_] no data 

fq less than -27000 
Ea -27000 — -25000 
Hi -25000 — -24000 
HM -24000 — -21000 
Hl oreater than -21000 


Figure 5. Polygon map of farm business profit in 1991-1992, all 
broadacre farm ($) 


Key: 

[_] no data 

less than 29000 
Ei 29000 — 32000 
Hl 32000 — 42500 
42500 — 46500 
HB oreater than 46500 


Figure 6. Polygon map of interexpectile range of farm business 
profit in 1991-1992, all broadacre farms ($) 


local variability. An expectile bears the same relationship to 
the mean as the corresponding quantile does to the median. In 
particular, the difference between the 75th and 25th expectiles 
of a distribution is a measure of the spread of the distribution 
in the same way as the interquartile range is a measure of this 
spread. The smoothing program contains a module for non- 
parametric M-quantile regression (Breckling and Chambers 
1988) which is used to fit a smooth surface to the expectiles 
of the Y-distribution at any location. The difference between 
the smoothed 75th and 25th expectile surfaces (the smooth 
expectile analogue of the interquartile range) is then mapped 
to show areas of high and low variability in the data. 

Not surprisingly, this smooth interexpectile range tends to 
be highest in areas where the farms are sparsely located and 
the farm-to-farm variability in Y is therefore highest. The 
interexpectile range map corresponding to Figure 5 is shown 
in Figure 6. Note that these smoothed interexpectile range 
maps provide similar information to confidence bands at any 
particular point on the map. However, they do not have the 
same repeated sampling interpretation as confidence intervals, 
and hence should be treated as guides to, rather than measures 
of, the uncertainty associated with a particular map. 
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For confidentiality reasons, care must be taken when 
mapping the smoothed data for publication to ensure that the 
locations of the surveyed farms are not revealed. Another 
requirement is output quality compatible with desktop 
publication packages. Two procedures for generating the final 
maps that satisfy these requirements have been developed 
using ARC/INFO. 

In the first method, a Thiessen polygon is constructed 
around each farm. The polygon defines the area closer to that 
farm than to any other farm. The farm location is not in the 
centre of its polygon, and the polygon shape does not 
resemble the shape of the farm, so the polygons conceal the 
locations of the survey farms, as shown in Figure 7. The 
whole of each polygon is coloured according to the smoothed 
value of Y at the farm location in that polygon. Usually ten 
colours are used in each map and the estimated population 
deciles of the smoothed data are used as boundaries for the 
colour area. The maps shown in this paper are black-and- 
white analogues of these colour maps. 


Figure 7. Thiessen polygons constructed around selected ABARE 
survey farms. Farm location is shown as a small square 
within each polygon 


In the second method, smoothed values on a dense 
rectangular grid are used in place of smoothed values at the 
farm locations, and a further minor interpolation of the data 
is carried out in ARC/INFO. A continuous 3-dimensional 
surface which passes through the smoothed values at the grid 
points is built in two steps. As a first approximation, a faceted 
surface of triangles obtained by Delauney triangulation is 
constructed, and then a bivariate fifth degree polynomial is 
fitted within each triangle using Akima’s algorithm (Akima 
1978). The resulting continuous surface is then contoured 
using the estimated population deciles. Figure 8 is an example. 

In this second method of presentation, the locations of the 
survey farms are not used in any way, thereby completely 
concealing the location of each survey farm. It also gives 
smooth contours, and the result is not as patchy as the 
polygon based map. Moreover, it is preferred by ABARE’s 
graphics staff because it reduces the number of areas to be 
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Key: 

(_] no data 

less than -27000 
EB -27000 — -25000 
EM - 25000 — -24000 
Hl -24000 — -21000 
I greater than -21000 


Figure 8. Contour map of farm business profit in 1991-1992, all 
broadacre farms ($) 


separately coloured and has lower storage requirements, 
enabling the maps to be more readily manipulated in desktop 
publishing packages. Its disadvantage is that it uses more 
computing time in the ARC/INFO stage. 

Since the above procedures interpolate across all of 
Australia, including areas where there is no agricultural 
activity, the final stage of the map production in ARC/INFO 
is the “blanking out” of those areas of Australia where there 
are few or no farms involved in the particular broadacre 
industry represented by the map. As Figure 9 shows, different 
areas are blanked out for different industries. 


Key: 

[_] no data 

less than -180 
-180 — 0 

Ei 0 — 135 

BB 135 — 250 

Bl greater than 250 


Figure 9. Polygon map showing expected change in wool 
production, 1991-92 to 1992-93, farms with 100 or more 
sheep in 1991-92 (kg) 


5. DISCUSSION 


In this paper we have demonstrated that when survey data 
has a spatial dimension, as in the case of the AAGIS, spatial 
smoothness concepts may be useful to the analyst. The 
concept can be used to modify survey weights to ensure less 
variable small area survey estimates. It may also be used to 
smooth the data along spatial dimensions before mapping the 
spatial mean function. 


Because we describe mapping in this paper, we have only 
considered smoothing along spatial dimensions. However, it 
is clearly possible to use the same techniques to smooth along 
other dimensions. Thus, if there is reason to expect the 
presence of strong serial correlation when the underlying 
population is ordered according to some variable, then one 
can consider applying the methods described in this paper to 
mapping the “change” in the survey variables relative to the 
change in this variable. In doing so, it should be noted that 
such “maps” are nothing more than nonparametric estimates 
of the conditional means of the survey variables given this 
“ordering” or “smoothing” variable. The analyst should, how- 
ever, remember the “curse of dimensionality”: the effective 
sample size drops sharply with each additional smoothing 
variable used in these nonparametric techniques. 

Finally, in mapping the survey data, we have used kernel- 
based estimation techniques. However, spline smoothing, or 
even parametric methods could also be used. We regard the 
choice of smoothing technology as somewhat subjective and 
purpose specific, as there are no definitive objective reasons 
for preferring one method over another. 
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APPENDIX 


In the last few years a number of optimality properties 
have been established for the locally-linear kernel weights 
(see for example Wand and Jones (1995) and references 
therein). We therefore compared Nadaraya-Watson (NW) and 
locally-linear (LL) weight sequences using fixed (FBW) and 
k-nearest-neighbour (NN) bandwidths with each weight 
sequence. For each of these combinations, we selected the 
bandwidth using least-squares cross-validation (CV), and an 
ad hoc method (detailed in the last paragraph of this section) 
aimed at reducing the speckledness of a map (SF). 

Two criteria were used to evaluate the performance of each 
methodology. The first, MSE, is the obvious statistical 
criterion for assessing a biased estimator. The second 
criterion is more ABARE specific. As estimates are produced 
both in tables (by State) and in maps, the impression of the 
state average given by the map should be close to the 
tabulated value. We therefore used a weighted sum of the 
squared differences between the state averages of the raw and 
smoothed survey data (SB*). This measure was also calculated 
at regional rather than state level (RB’; there are between one 
and nine regions in each state). 

Data were generated at the survey farm locations using 
three smooth functions with varying degrees of smoothness 
(measured by {m”) and normal mixture errors. For example, 
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where z, and z, are the longitude and latitude of the point z. 
The functions m,(z) were scaled to have the same range as the 
smoothed values of a key survey variable, and the errors were 
scaled to have the same range as the residuals of the same 
variable after smoothing. Large variances were generated at 
locations with high residuals, and small variances at locations 
with low residuals. The simulation results based on the 
smooth function are given in Table 5. 

Using MSE as the criterion for assessing methodology, the 
results were not consistent for the three functions mz). 
However, when either RB’ or SB? was used as the perfor- 
mance measure, the LL estimator with k-nearest-neighbour 
bandwidth selected using SF outperformed the other methods 
by at least ten percent for each function mz), and is 
therefore the currently preferred methodology for producing 
ABARE’s maps. 


Table 5 
Comparison of locally-linear (LL) and Nadaraya-Watson (NW) 
weight sequences, using fixed (FB W) and k-nearest-neighbour (NN) 
bandwidths selected by least-squares cross-validation (CV) 

and the criterion detailed below (SF). The results were obtained 

from 400 independent samples with mean function 

and normal mixture errors. The MSE values were 

calculated using the average over the finite population 


of (y - m(z)) 
MSE x 107 RB’ x 10°77 SB? x 107 
(OY SF CV SF CV SF 
LL FBW 39.64 93.93 4.44 1.67 1633039 
NN 20.50 22.83 NOY) eS) 0.37 0.14 
NW FBW 41.91 52.78 3.29 aa OB 4a Onli 
NN PBI DODD, 3.03 2.33 0.62 041 


The bandwidth selection method aimed at reducing the 
speckledness of a map (SF) is a measure of the smoothness of 
the map: it measures how similar the smoothed value is at any 
farm to that of its neighbours. Let p(i) be the survey estimate 
of the percentile of the smoothed variable at the i-th farm. Let 
S; be the set of indices of the six farms closest to the i-th 
farm. In this method, the value of 


SF(h) =(6n)'Y) |p@ - pb)| 


U 
keS; 
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is calculated. It is scale-free, and decreases monotonically as 
the bandwidth decreases. The chosen bandwidth is the 
smallest bandwidth with a sufficiently small (< €) rate of 
decrease of SF. The value of € was chosen subjectively 
following detailed examination of maps of five key variables 
for five values of €. 
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Using Data on Interruptions in Telephone Service 
as Coverage Adjustments 


J. MICHAEL BRICK, JOSEPH WAKSBERG and SCOTT KEETER' 


ABSTRACT 


Telephone surveys in the U.S. are subject to coverage bias because about 6 percent of all households do not have a 
telephone at any particular point in time. The bias resulting from this undercoverage can be important since those who do 
not have a telephone are generally poorer and have other characteristics that differ from the telephone population. 
Poststratification and the other usual methods of adjustment often do not fully compensate for this bias. This research 
examines a procedure for adjusting the survey estimates based on the observation that some households have a telephone 
for only part of the year, often due to economic circumstances. By collecting data on interruptions in telephone service in 
the past year, statistical adjustments of the estimates can be made which may reduce the bias in the estimates but which at 
the same time increase variances because of greater variability in weights. This paper considers a method of adjustment 
using data collected from a national telephone survey. Estimates of the reductions in bias and the effect on the mean square 
error of the estimates are computed for a variety of statistics. The results show that when the estimates from the survey are 
highly related to economic conditions the telephone interruption adjustment procedure can improve the mean square error 


of the estimates. 


KEY WORDS: Coverage; Bias; Weighting adjustment; Telephone sampling; RDD surveys. 


1. INTRODUCTION 


Telephone surveys provide a relatively economical method 
of data collection compared with face-to-face interviewing. 
However, telephone surveys in the U.S. are subject to an 
important source of bias that does not affect household 
surveys conducted with face-to-face interviewing: at present 
only 94 percent of households nationally have telephone 
service at any given time. Moreover, for some populations 
such as households with young children, coverage rates are 
even lower. 

Weighting that includes poststratification based on demo- 
graphic variables known to be associated with telephone 
coverage is effective in mitigating some of the consequences 
of coverage bias in telephone surveys. Postsurvey weighting 
is also generally used to compensate for nonresponse and 
other biases. But even when effective, weighting to known 
demographic totals only partially solves the problem of cover- 
age bias, undercompensating for some variables (Massey and 
Botman 1988) and overcompensating for others (Brick, 
Burke, and West 1992). 

This article describes a study of an alternative method for 
adjusting telephone survey data to compensate for coverage 
bias. The method, suggested by Keeter (1995), is based on the 
observation that telephone subscription is a dynamic condi- 
tion not just across households in the population, but also 
within many households over time. A sizable number of U.S. 
households lose and gain telephone status during a given year. 
Because of this phenomenon, the telephone population at a 
given time includes households that have recently been in the 


nontelephone population. Despite considerable information 
on the size and characteristics of the nontelephone population, 
little is known about its dynamics over shorter time periods. 
Evidence from social workers, telephone companies, and 
others who deal with indigent households suggests that for 
many families, telephone subscription is episodic. House- 
holds may have a telephone when they can afford it, but the 
telephone may be turned off when times are harder, or when 
the bills get too large to manage, (Federal Communications 
Commission 1988). It is not known how many households 
change their telephone status and how long they stay in a 
particular status. 

Keeter (1995) examined two household panel surveys 
to obtain estimates of the dynamics of telephone service 
subscription. Those households that changed telephone status 
(presence of a telephone in the household) are called 
‘transient’ households. For data from one panel survey that 
collected data 12 months apart, half of the 6 percent of all 
households without a telephone at either time were transient. 
For the other panel survey in which data were collected only 
two months apart, one-fourth of the 6 percent of households 
without telephones at either point in time were transient. 
Since these estimates were based on observations at two 
points in time rather than continuous measurement, they 
underestimate the percent of households that are transient. 
Nevertheless, these results show that a substantial proportion 
of households without a telephone at a specific point in time 
is transient. 

Another important condition that must be satisfied if the 
transient telephone households are to be useful in reducing 


1 J. Michael Brick and Joseph Waksberg, Westat, Inc., 1650 Research Blvd., Rockville, MD 20850, U.S.A.; Scott Keeter, Virginia Commonwealth University, 
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coverage bias involves the characteristics of transient house- 
holds and nontelephone households. If the two groups are 
not similar, then the adjustments will not be effective. Using 
the panel data and data from several Virginia surveys, Keeter 
(1995) showed that the characteristics of the transient 
households are much more consistent with nontelephone 
households than telephone households. 

These findings suggest the possibility that weighting 
adjustments that use the data from households that have tele- 
phones only sometimes during the year might be an improve- 
ment over the current practice. To evaluate this approach to 
adjusting the weights, questions were added to two national 
surveys conducted in 1993 by Westat. Both of these surveys 
were random digit dial (RDD) and computer assisted 
telephone surveys, and the data were collected in the tele- 
phone research centers of Westat. 

One of the surveys is the National Household Education 
Survey of 1993 (NHES:93). The NHES:93 was conducted for 
the National Center for Education Statistics of the Depart- 
ment of Education in the spring of 1993 to study issues 
related to school readiness of young children and school 
safety and discipline of children in school. The other survey 
was the National Survey of Veterans (NSV) which was 
conducted in the second half of 1993 for the U.S. Department 
of Veterans Affairs. In this survey, adults were screened to 
determine if they were veterans, and the veterans were then 
asked about a variety of topics including their health, educa- 
tion, and financial status. 

Below, we present estimates of the percentage of persons 
that experienced some interruption of telephone service, 
describe procedures for adjusting the survey weights using 
these data, and discuss the statistical implications of using the 
adjusted weights. The final section summarizes the findings 
and gives some considerations for using this technique in 
RDD telephone surveys. 


2. ESTIMATES OF INTERRUPTIONS 
OF TELEPHONE SERVICE 


Estimates of the percentage of persons with interruptions 
of telephone service from national surveys were needed to 
further examine the potential of reducing coverage biases 
using these data. Questions were added to the NSV and the 
NHES:93 for this purpose. In the NSV, about 23,000 house- 
holds were screened and interviews were completed with over 
5,500 eligible veterans. In the screening interview, all house- 
hold members 14 years and over were enumerated and 
questions were asked about their characteristics and their 
veteran status. If a sampled adult was a veteran, then a more 
detailed interview was attempted. The results reported here 
are those asked about the adults enumerated in the screening 
interview which included only a few characteristics of the 
adults and the household. 

In the NHES:93, 64,000 households were screened and 
nearly 30,000 interviews were conducted within those 
screened households. Two survey components were included: 


School Readiness (SR) and School Safety and Discipline 
(SS&D). Approximately 11,000 parents of 3- to 7-year-olds 
completed interviews on SR topics and about 12,700 parents 
of children in grades 3 through 12 were interviewed for the 
SS&D component. Data on interruptions in telephone service 
were collected from households in which at least one SR or 
SS&D interview was completed. 

Since the responses to the questions in the NHES:93 were 
only obtained for those households that completed either an 
SR or SS&D interview, many characteristics of the children 
can be analyzed, but the data do not apply to as broad a 
population as the NSV. The NSV applies to all adults, but 
only limited data were collected on most of the adults. For all 
households that had completed an interview (a screening 
interview in the NSV and a more detailed interview in the 
NHES:93), a member of the household was asked if the 
household had experienced an interruption in telephone 
service in the last 12 months and how long it lasted. 


Estimated Service Interruptions in the NSV and 
NHES:93 


The estimated percentage of persons in households that 
had a telephone interruption of one day or more during the 
last 12 months varies substantially from survey to survey. 
Only 2.3 percent of adults had an interruption of one day or 
more based on the data from the NSV, while the percentage 
from the NHES:93 for younger children (the SR population of 
3- to 7-year-olds) was 12.0 percent, and for the SS&D popu- 
lation of older children (grade 3 through 12) it was 9.2 percent. 

Figure 1 shows estimates and 95 percent confidence 
intervals of the percentage of persons that had interruptions 
of one day or more along with estimates for those with 
interruptions of telephone service that lasted for at least one 
week and at least 4 weeks. While the percentages vary 
from sample to sample, the patterns of increase by length of 
interruption are relatively stable. The percentage with inter- 
ruptions of one week or longer is less than half the percentage 
with any interruption, and the percentage with interruptions 
of 4 weeks or more is about one-fourth the percentage with 
any interruption. 
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Figure 1. Estimated percentage of persons with interrupted 
telephone service from the three populations 
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The large difference in the estimates from the NSV and the 
NHES:93 comes from at least two important sources. The 
first source is that the populations were different. We would 
expect young children to live in households that experience 
more interruptions than older children and adults. Thornberry 
and Massey (1988) estimated that the telephone coverage rate 
for young children was lower than for any other age group. 
Thus, the difference of about 3 percent in the estimates of the 
percentage with an interruption between the younger (SR) and 
older (SS&D) children from the NHES:93 is reasonable. 

The difference in the populations does not completely 
account for the large difference between the NSV and the 
NHES:93 estimates. An important reason for this difference 
is related to the way the questions were asked in the two 
surveys. The NHES:93 interview began by asking, “During 
the past 12 months, has your household ever been without 
telephone service for more than 24 hours?”. In the NSV inter- 
view, respondents were asked if, “At any time during the past 
12 months, has your household not had telephone service?”’. 
This was followed by a question that asked if the interruption 
was for at least 24 hours. Thus, the NSV version was a 
screening item followed by a more detailed question. This 
type of construction often depresses reports of subsequent 
activities, which is consistent with the lower NSV estimates. 

A more important reason for the difference is probably due 
to the wording of the questions. With the NSV question, a 
‘no’ response may have confused respondents because the 
question asks if they did not have telephone service. Converse 
and Presser (1986) discuss the problems that arise with this 
type of question construction. The wording for the NHES:93 
is less confusing. The combination of the wording and the use 
of a screening item in the NSV is likely to be the main reason 
for the smaller estimate using the NSV questionnaire. 

The difference in the estimates associated with the 
different ways of asking the interruption questions is evident 
from the estimates from two surveys conducted in Virginia by 
Virginia Community University. Ina November 1993 survey, 
the items about telephone interruptions were asked using the 
NSV wording; in April 1994 the items were changed to the 
NHES:93 wording. The results from the surveys parallel the 
differences in the estimates between the NSV and the 
NHES:93. The November 1993 Virginia study estimated that 
3 percent had an interruption in service in the last 12 months, 
while in April the estimated percentage was 9 percent. Thus, 
it is clear that the different ways of asking the questions 
heavily influenced the size of the estimates, and it suggests 
that the estimates from the NSV are biased downward. Some 
adults who did experience an interruption in telephone service 
during the previous 12 months probably responded incorrectly 
in the NSV. 


Characteristics of Persons With Service Interruptions 


Estimates of the percentage of persons who had a tele- 
phone interruption are examined below by the characteristics 
of the person to evaluate the potential of using these data to 
adjust for coverage bias. We estimated the percentage of 
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persons in households with any interruption in service by 
characteristics collected in both the NSV and the NHES:93. 
These estimates are shown in the first part of table 1. Some 
differences in the distributions may be due to the different 
ways of asking the questions. For example, the education 
classification is different in the two surveys: in the NSV 
education is recorded for the oldest person in the household, 
while in the NHES:93 education is the highest for either of 
the parents of the child. 

All subsequent analysis is restricted to NHES:93 data for 
two reasons. First, more data on the characteristics are avail- 
able from the NHES:93 detailed SR and SS&D interviews 
than the NSV screening interview. Second, the telephone 
interruption estimate from the NSV is biased due to the 
wording of the item, as discussed earlier. Of course, the 
NHES:93 estimates apply to households with children which 
have higher nontelephone rates than the general population, 
and in that sense they do not reflect the situation for the total 
population. 

Using the NHES:93 data, we find that the percents of 
persons with some interruption are relatively consistent for 
the SR and the SS&D populations (see table 1). The 
characteristics generally associated with lower economic status 
have the highest percentage with interruptions. For example, 
the percentage of children with interruptions in both the SR 
and SS&D populations is larger for those from households 
with lower household income than for those from households 
with higher income. Similarly, children participating in public 
assistance programs (WIC or free meals) have much higher 
rates of service interruptions than nonparticipants. However, 
the percentages of children in households with telephone 
interruptions are less variable for characteristics related to 
school readiness and school safety and discipline than for 
the socioeconomic items. Additional characteristics for 
both populations were examined and presented in Brick, 
Keeter, Waksberg and Bell (1996), but are not shown here. 
For most of the other substantive items, the differences in the 
percentage of persons with some interruption in telephone 
service were either not statistically significant or not large 
enough to be of great practical importance. 


3. WEIGHT ADJUSTMENTS 


In almost all sample surveys, the data collected from 
respondents are adjusted to account for nonresponse and 
noncoverage and to reduce the variability in the estimates by 
using auxiliary data from other data sources. One of the most 
important benefits of this type of adjustment in telephone 
samples is that it often reduces the bias associated with the 
undercoverage of persons living in households without tele- 
phones. 

Kalton and Kasprzyk (1986) discuss adjustments to the 
base weights, classifying the adjustments into four categories: 
population weighting adjustments, sample weighting adjust- 
ments, raking ratio adjustments, and response probability 
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Table 1 


Estimated Percentage of Persons With Any Interruptions in Telephone Service in Last 12 Months for Three Populations 
——— EE EE EEE EEE ee ee eee ee eee eee eee 


NSV NHES:93 SR NHES:93 SS&D 
Estimate Standard Estimate Standard Estimate Standard 
error error error 
Total 28, 0.1 12.0 0.4 92 0.3 
Region 
Midwest DES 0.2 11.0 1.0 flies) O77 
Northeast 2.0 0.2 eS) 2 9.0 0.8 
South 2.6 0.2 13.6 0.7 10.8 0.6 
West 2.4 0.2 eS 0.9 oF 0.8 
Race/ethnicity! 
White 2.0 0.1 9.3 0.5 7.2 0.3 
Black 5h8) 0.4 19.8 iS) 14.7 1.1 
Hispanic 3.9 0.5 Te, IS) 14.1 ibil 
Other 2.6 0.6 Miley 2.6 Os 135) 
Education? 
Less than high school diploma 37) 0.2 18.4 1.8 17.4 1.6 
High school graduate 2.0 0.2 15.4 0.8 11.0 0.8 
Some college 23 0.2 11.8 0.7 8.6 0.5 
Bachelor's degree 1.6 0.2 5) 0.8 Se) 0.8 
Graduate school MDP) 0.3 937 0.7 4.5 0.6 
Household income 
$10,000 or less 22.8 1.3 19.0 ee 
$10,001 to $20,000 19.9 1.4 15.7 Jel 
$20,001 to $30,000 9.3 0.8 7.9 0.6 
More than $30,000 5.5 0.5 5.0 0.3 
Women, infant and children program 
participant’ 
Yes 18.2 1.3 
No 8.0 0.6 
Free meal at school or center* 
Yes Ile 182 
No 7.6 0.5 
Birth weight 
5.5 pounds or less 12.0 1.6 
Greater than 5.5 pounds 12.0 0.4 
School control 
Public 9.4 0.4 
Private ES ia! 
Ease of obtaining marijuana at school? 
Very or fairly easy OF 0.6 
Hard 8.0 0.8 
Nearly impossible 9.0 0.7 


' Race/ethnicity is reported for the oldest member in the NSV and for the child in the NHES:93. 

* Education is for the oldest household member in the NSV and the most educated parent of the child in the NHES:93. 

> Estimate restricted to preschoolers. 

* Estimate applies to children except preschoolers. 

> Estimate applies only to children in grades 6 through 12. 

Source: U.S. Department of Veterans Affairs, National Survey of Veterans, summer/fall 1993, and U.S. Department of Education, National Household 
Education Survey, spring 1993. 
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adjustments. In the NHES:93, sample weighting adjustments 
and raking ratio adjustments were used. Sample weighting 
adjustments were used to account for differential nonresponse 
from sampled persons. Raking ratio adjustments were then 
used to make the specified marginal distributions of the 
sample correspond to totals from the October 1992 Current 
Population Survey (CPS). One of the most important benefits 
of the type of raking ratio adjustment used in the NHES:93 is 
that it reduces the bias associated with the undercoverage of 
persons living in households without telephones because the 
CPS covers persons in both telephone and nontelephone 
households. 

The data on telephone service interruptions can be used to 
make a response probability adjustment. Response probability 
adjustments are constructed by assuming that each sampled 
unit has a probability of responding to the survey, estimating 
that probability, and then using the inverse of the estimated 
response probability as a weighting adjustment. The Politz 
and Simmons (1949) method is probably the best known 
application of the response probability adjustment procedure, 
and Kalton and Kasprzyk (1986) discuss others. 

To apply this type of adjustment using the telephone 
service interruption data, assume that living in a telephone 
household is a dynamic phenomenon and that a probability 
distribution can be associated with this status. Conceptually, 
a survey is conducted by sampling from this distribution and 
observing only those members that live in telephone house- 
holds at the time of the survey. The probability of living in a 
telephone household (the equivalent of the response proba- 
bility) must then be estimated for each respondent. The inverse 
of the estimated probability is the coverage adjustment. This 
model assumes that each person can be assigned a probability 
of being in a household with a telephone and that the 
probability is between zero and one (but not equal to zero). 

The data on whether or not a household had an interruption 
in telephone service and the length of that interruption are the 
basis for this type of adjustment. Persons are divided into two 
categories: those in households with interruptions in service 
and those in households without interruptions in service. The 
probability is assumed to be equal to one for persons in 
households without interruptions and their weights are not 
adjusted. The weights of persons in households with at least 
some interruptions in the last 12 months are adjusted to 
account for other households that have a probability of being 
covered of less than one. The adjustments may vary depending 
on the length of time they lived in nontelephone households 
and on other characteristics of the household. The purpose of 
having different adjustments is to account for the fact that 
some persons are more likely to live in nontelephone house- 
holds than others. 

Although the weighting adjustments may reduce the under- 
coverage bias, introducing adjustments also typically increases 
the variances of the estimates. Kish (1992) discusses the 
reasons for unequal weights as well as the consequences from 
using them in a variety of situations. He advocates a common 
statistical approach of balancing the bias reductions against 
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the variance increases. If the weights reduce the bias of the 
estimates significantly, then it may be worthwhile accepting 
the variance increases. On the other hand, small reductions 
in bias associated with large variance increases are not 
recommended. 

In the remainder of this section, the specific weighting 
adjustment procedures are described. The statistical properties 
of the weights developed under four alternative adjustment 
schemes are presented. The alternative weights are applied to 
the NHES:93 data and the decrease in the bias of the 
estimates is compared with the increase in the variance of the 
estimates due to the unequal weighting. 


Adjustment Schemes 


The first step was to decide how to classify the length of 
interruption in telephone service. Various lengths of interrup- 
tions were examined to determine cut-offs that discriminated 
between temporary interruptions, not due to economic causes 
and others. It was decided to use two categories for forming 
adjustment cells: one week or more, and one month or more. 

Within each of the length-of-service interruption catego- 
ries, the children were classified into adjustment cells based 
on either parental education or tenure (home ownership). 
Race/ethnicity was used to form cells within the parental 
education and tenure categories. These cells were chosen 
because the percentage of persons with interruptions varied 
by these characteristics and the corresponding data were also 
available from the CPS. Four adjustment schemes were 
defined using these items: 

Scheme A1 -— children in households that had a telephone 
service interruption of one week or more within categories 
defined by parental education (less than high school, high 
school diploma, college diploma or above) and race/ethnicity 
(Hispanic, black/non-Hispanic, white and other/non-Hispanic); 
Scheme A2 - children in households that had a telephone 
service interruption of one month or more within categories 
defined by parental education and race/ethnicity; 

Scheme B1 — children in households that had a telephone 
service interruption of one week or more within categories 
defined by tenure (own/other, rent) and race/ethnicity; and 


Scheme B2 - children in households that had a telephone 
service interruption of one month or more within categories 
defined by tenure and race/ethnicity. 


The adjustment factors for these schemes could not be 
obtained directly from the NHES:93 data because no data 
were collected from households without telephones. Instead, 
the adjustments were developed using both CPS and 
NHES:93 data and then applied to the NHES:93 weights. 

To motivate the adjustment of the weights, consider 
partitioning the universe of persons into four components: ¢, 
is the number of persons in telephone households with no 
telephone interruptions in the past year; t, is the number of 
persons in telephone households with some telephone 
interruptions in the past year; t, is the number of persons in 
nontelephone households with no telephone interruptions in 
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the past year (i.e., persons who lived in nontelephone house- 
holds throughout the entire year); and ft, is the number of 
persons in nontelephone households with some telephone 
interruptions in the past year. As noted above, the response 
probability model assumes ¢, = 0. 

Using the CPS it is possible to estimate t, + t, and t, 
(assuming f, = 0); designate these estimates as t, + t, and i . 
respectively. From the NHES:93, t, and #, can be estimated 
separately; call these estimates t* and 1, respectively. The 
bias in the NHES:93 estimates arises because they are from a 
telephone survey and do not include persons in nontelephone 
households (¢,). 

A weight adjustment of A = 1 + t,/t, would result in 
unbiased estimates of totals; however, this adjustment in- 
volves unknown, population quantities that must be estimated. 
Since t, can only be estimated from the NHES:93 and tf, can 
only be estimated from the CPS (assuming t, = 0), the adjust- 
ment is expressed in ratios to reduce the bias due to estimating 
the totals from different surveys. The revised weight is 


Wie Aol 61 ed (1) 


where w; is the NHES:93 weight adjusted for nonresponse of 
sampled persons but not yet raked to October 1992 CPS 
totals, 6, = 1 if the person lives in a household that had an 
interruption of telephone service in the last year and is zero 
otherwise. The quantity in parenthesis in (1) is an estimate of 
A, the weight adjustment. 

Revised weights were computed separately for the SR and 
SS&D components. Rather than the overall adjustment as 
given in (1), the weight adjustments were computed within 
the cells defined for each of the four weighting schemes (A1, 
A2, B1, and B2). Table 2 shows the resulting adjustment 
factors for the SR and SS&D components. The adjustments 
in the first column are those for schemes Al and B1. The 
second column contains the adjustment factors for schemes 
A2 and B2. The adjustment factors for the schemes based on 
the one month or more interruptions are greater than those 
based on the one week or more because the denominator of 
the ratio is, by definition, smaller for this classification (see 
Figure 1 for estimates of the percentage of persons with 
interruptions for each scheme). 

The last weighting step rakes the four alternative weights 
to the same October 1992 CPS totals used in raking the 
standard NHES:93 person-level weights. The result of this 
process is the standard NHES:93 weight and four alternative 
weights based on different adjustment schemes. All five of 
the weights conform to the same marginal totals. The only 
difference in the weights is the adjustment for the telephone 


Table 2 
Weighting Cell Adjustments Factors, Based on Length of Interruption of Telephone Service 
SR SS&D 
Factor Length of service interruption 
One week One month One week One month 
or more or more or more or more 

Cells defined by parental education and race/ethnicity (Schemes Al and A2) 

Less than high school; Hispanic Dal 16.35 4.89 8.52 

Less than high school; black, non-Hispanic 5.10 6.72 4.26 5.95 

Less than high school; white and other, non-Hispanic 4.98 Shed 3.81 4.86 

High school diploma; Hispanic Bees 2.76 2.67 4.5] 

High school diploma; black, non-Hispanic 2.65 SS 3.06 4.71 

High school diploma; white and other, non-Hispanic 2.16 2.79 2.18 3.09 

College degree or more; Hispanic 1.34 PLES) 1.96 8.22 

College degree or more; black, non-Hispanic NaF 2.64 1.35 8.83 

College degree or more; white and other, non-Hispanic 1.58 2.09 1.91 3.48 
Cells defined by tenure and race/ethnicity (Schemes B1 and B2) 

Renter; Hispanic 3.74 Sls 3.58 6.08 

Renter; black, non-Hispanic 823 4.54 3.38 4.95 

Renter; white and other, non-Hispanic 2.43 2.96 2.99 4.00 

Owner/other; Hispanic 2.00 3.06 2.81 5.66 

Owner/other; black, non-Hispanic D3 3.46 2.90 6.11 

Owner/other; white and other, non-Hispanic 2.26 3.45 2.03 3.10 
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service interruption prior to raking. The standard weights are 
not further adjusted while the alternative weights have 
different adjustments depending on the scheme. 


4. FINDINGS 


As noted above, adjustment of the weights to reduce the 
bias increases the variability of the weights, thus increasing 
the variance of the estimates. Kish (1992) gives an 
approximate expression for this increase in variance arising 
from unequal weights. We call this expression for the increase 
in variance due to differential weights the variance inflation 
factor (VIF). The VIF can be written as 


VIF = 1 + CV? (weights) (2) 


where CV is the coefficient of variation of the weights. 


Table 3 shows the VIF for the standard NHES:93 weights 
for each component. The SS&D component is broken down 
by the grade of the student, because youth were selected at 
different rates for these grade levels. The V/F for each of the 
components is about 1.4, indicating the variance is inflated by 
about 40 percent due to the variability in the standard weights. 
The VIF for the combined SS&D file is somewhat larger (1.5) 
because it includes youth who were sampled at different rates. 

The other factors given in table 3 are the ratios of the VIF 
for the four alternative weights to the V/F for the standard 
weight. These ratios show how much greater the variances of 
estimates produced using the alternative weights are expected 
to be as compared to the variances of the standard NHES:93 
weights. 

Overall, the increase in variance due to the telephone inter- 
ruption coverage adjustment are from 9 to 13 percent for 
schemes Al and B1 in the SS&D component but up to 
20 percent for the SR component. The ratios are larger for the 
schemes A2 and B2, ranging from 24 to 35 percent, with the 
largest ratio for Scheme A2 for the SR component. The larger 
ratios (hence V/Fs) for the schemes based on interruptions of 
one month or more are a consequence of the larger and more 
variable factors shown in the second column of table 2. The 
ratios for the SR population are higher than the SS&D ratios. 
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4.1 Coverage Bias Reduction 


If estimates of the same characteristics as those produced 
from the NHES:93 were available from an independent 
source and these benchmark estimates were free of telephone 
coverage bias, then it would be possible to compare the five 
estimates to the benchmark. However, benchmarks compar- 
able to the estimates from the two components of the 
NHES:93 do not exist and other methods are needed to assess 
the bias-reducing potential of the coverage adjustments. 

Due to of the lack of a benchmark, some model assump- 
tions are required to assess the effectiveness of the adjust- 
ments. For this evaluation we assume that the adjustment 
procedures reduce the coverage bias. As a result of this 
assumption, the difference between the standard estimate and 
the adjusted estimate is considered an unbiased estimate of 
the decrease in the coverage bias resulting from using the 
procedures. Clearly, the coverage bias is not completely 
eliminated by any of the adjustment procedures. Even if the 
model were correct, the bias reductions from the data would 
still be subject to sampling error. Despite the problems with 
this assumption, this type of assumption is necessary to obtain 
some idea of the effectiveness of the adjustment. If the 
adjustment eliminates the bias, the mean square errors of the 
adjusted estimates are equal to the variances of the estimates, 
with no contribution from coverage bias. Therefore, the 
model assumption is favorable to the adjusted estimates, 
positing the adjusted estimates to be unbiased. The impact of 
this assumption is discussed critically after evidence of the 
effectiveness of the method is presented. 

The estimate from each scheme can be compared to the 
standard NHES:93 estimate, and the difference between the 
standard estimate and the adjusted estimate is an estimate of 
the reduction in the coverage bias. With four adjusted esti- 
mates, four different estimates of bias reduction are possible. 
The estimated reduction in bias is 

b, - p ie p a’ (3) 
where J, is the estimated bias reduction using adjustment 
scheme a (a = Al, A2, B1, or B2), p, is the estimate of the 
proportion using the standard estimate, and p, is the 
estimated proportion using adjustment scheme a. 


Table 3 
Ratios of Variance Inflation Factor Due to Coverage Adjustment 


eee VIF* Ratio of scheme's VIF to standard weight's VIF 
Component ee S standard Scheme Scheme Scheme Scheme 
weight Al A2 Bl B2 
School Readiness 10,888 1.36 1.20 11315) 1.16 1.26 
School Safety and Discipline 
3rd through 5th graders 2,563 137 Laas ies 1.26 
6th through 12th graders 10,117 1.39 20, 1.09 1.24 
3rd through 12th graders 12,680 1.49 1.26 eal 125 


* VIF is the standard inflation factor. It is the coefficient of variation of the weights squared plus one. 
Source: U.S. Department of Education, National Center for Education Statistics, National Household Education Survey, spring 1993. 


192 Brick, Waksberg and Keeter: Using Data on Interruptions in Telephone Service as Coverage Adjustments 


The estimated reductions in bias under each adjustment 
weighting scheme are given in table 4. Estimates for additional 
characteristics are given in Brick et al. (1996). The bias 
reductions in the standard estimate assume each adjustment 
scheme eliminates the coverage bias. 

The bias reduction estimates for most of the items in 
Table 4 are less than one percent and consistent in direction 
across the schemes. Before summarizing the estimates, we 
must account for the fact that the total number of children is 
constant for all the estimates due to the raking of the estimates 
to the CPS totals. The fixed total number of children across 
response categories has two consequences: it creates a nega- 
tive correlation in the estimated reduction in bias across 
response categories; and it gives a false impression of the 
number of independent pieces of information in the tabled 
values. 

The approach taken to address to this problem in sum- 
marizing the bias estimates is to delete the estimate for one of 
the response categories for each item. The “no” response cate- 
gory for all items with “yes” and “no” response categories 
was deleted. For other types of variables, the response cate- 
gory with the smallest estimate was deleted. 

Figure 2 presents the absolute value of the reduction in 
bias estimated using scheme A1 for the SR characteristics, 
and figure 3 is the same representation for the SS&D. These 
figures use all the estimates presented in Brick et al. (1996), 
rather than just those shown in table 4. For both components, 
the bias reductions are small. The largest absolute bias is 
1.3 percent for SR and 0.9 percent for SS&D. The mean and 
median of the bias reductions and the absolute values of the 
bias reductions were also computed for each scheme and each 
component. For the SR component, the mean and median of 
the absolute value of the estimated bias reductions for the 
four schemes are between 0.2 and 0.4 percent. For the SS&D, 
the mean and median of the absolute values are between 0.1 
and 0.3. 
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Figure 2. Estimated reduction in absolute bias for School 
Readiness characteristics (scheme A1) 
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Source: U.S. Department of Education, National Center for Education 
Statistics, National Household Education Survey, spring 1993 


Figure 3. Estimated reduction in absolute value of bias for 
School Safety and Discipline characteristics 
(scheme A1) 


Bias Ratio 


The size of the absolute reduction in bias is not a very 
useful statistical measure of the impact of the bias because it 
does not take the magnitude of the sampling error of the 
estimate into account. Cochran (1977) discusses the impact 
on confidence intervals as the ratio of the bias to the sampling 
error varies. For each scheme the bias ratio is given by 


b 


S06)" ie 


with the standard error of the standard estimate as the 
denominator. As the bias ratio increases, the chance of 
covering the population value departs significantly from the 
nominal confidence interval. 

The bias ratios for selected characteristics are shown in 
Table 4. Many of the bias ratios for the SR items are large, 
even though the average and median ratios are near zero. 
Nearly half of the ratios for all the items examined are larger 
than 0.4 in absolute value. A ratio of 0.4 is large enough to 
reduce a nominal confidence interval from 95 percent to about 
93 percent. For the SS&D items, the bias ratios are smaller, 
with only 15 percent of all the items having bias ratios greater 
than 0.4. 


4.2 Mean Square Error 


Since the variance is not an adequate measure of error for 
biased estimates, the mean square error of the estimates is 
used instead. The mean square error (MSE) is the sum of the 
variance and the square of the bias of the estimate. 

The MSE can be estimated for the NHES:93 estimates 
by using the standard variance estimates and the bias reduc- 
tion estimates presented above. The estimated MSE can be 
approximated as 


MSE, = var(p,) +b; (5) 
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Table 4 
Estimated Reduction in Bias and Bias Ratio for Selected Characteristics of the NHES:93 
Standard estimate Estimated reduction in bias Bias ratio 
be Lebar Estimate Standard Scheme Scheme Scheme Scheme Scheme Scheme Scheme Scheme 
error Al A2 Bl B2 Al A2 Bl B2 
School Readiness (SR) population 
Parental educational level 
Less than high school graduate 8.6 0.3 =A leg) =A 0.1 0.1 =i) =. 0.3 0.3 
High school graduate or equivalent 33.9 0.8 0.4 0.3 -0.7 =i 0.5 0.4 -0.9 Sale 
Some college 57.5 0.7 1.3 1.6 0.6 0.9 1.9 Py8) 0.9 1.3 
Mother’s employment status 
No mother in household 2.4 0.2 -0.1 -0.1 -0.1 -0.1 = (0s) = Ord 0S) =0'5 
Employed 35 hours/week or more 34.3 0.5 0.5 0.8 0.2 0.5 1.0 1.6 0.4 1.0 
Employed less than 35 hours/week 20.9 0.5 -0.1 =O) 0.0 20 -0.2 -0.4 0.0 -0.4 
Seeking employment 6.6 0.4 0.0 Ont -0.1 -0.1 0.0 -0.3 -0.3 -0.3 
Not in labor force 35.8 0.6 -0.4 =(0.3) 0.0 0.0 -0.7 -0.5 0.0 0.0 
Father’s employment status 
No father in household 26.3 0.5 -0.4 -0.6 0.0 -0.1 -0.8 =e 0.0 =O 
Employed 35 hours/week or more 63.4 0.6 0.3 0.5 0.1 0.2 0.5 0.8 0.2 0.3 
Employed less than 35 hours/week 3.8 0.3 0.0 -0.1 0.0 0.1 0.0 50:3 0.0 0.3 
Seeking employment 32 0.3 0.0 0.0 -0.1 =\))74 0.0 0.0 -0.3 = O77, 
Not in labor force 3:3 0.2 0.1 0.2 0.0 0.1 0.5 1.0 0.0 0.5 
Time since doctor visit for routine care 
Less than 1 year 84.1 0.4 0.4 0.4 0.2 0.1 1.0 1.0 0.5 0.2 
Over 1 year 15.9 0.4 -0.4 -0.5 -0.2 EOxl =e = ihs3} =(055) = 
Birth weight 
5.5 pounds or less 93:3 0.3 ={0il 0.0 0.0 0.1 =0'3 0.0 0.0 0.3 
Greater than 5.5 pounds 6.7 0.3 0.1 0.0 0.0 -0.1 0.3 0.0 0.0 -0.3 
Child attending center-based program’ 
Yes 52.6 0.8 0.9 0.3 0.8 0.6 1 0.4 1.0 0.8 
No 47.4 0.8 -0.9 -0.3 -0.8 -0.6 =a! -0.4 ale) -0.8 
Child ever attended center-based program’ 
Yes 62.9 0.8 0.5 0.3 0.4 0.3 0.6 0.4 0.5 0.4 
No 37.1 0.8 =) -0.3 -0.4 =O) -0.6 -0.4 -0.5 -0.4 
Attended center-based program prior to school? 
Yes Wek) 0.5 0.6 0.7 0.5 0.6 iV 1.4 1.0 ile) 
No 26.5 0.5 -0.6 =D 7) -0.5 -0.6 =i -14 180) =ile72 
Women, Infant, and Children program participant’ 
Yes 33.8 1.0 -0.6 =! -0.8 =07/ -0.6 -0.1 -0.8 =(0E7/ 
No 66.2 1.0 0.6 0.1 0.8 0.7 0.6 0.1 0.8 0.7 
Free meal at school or center 
Yes 35.8 0.6 -0.9 Sale -0.5 -0.5 Sih -1.8 -0.8 -0.8 
No 64.2 0.6 0.9 il 0.5 0.5 ES 1.8 0.8 0.8 
Repeated kindergarten? 
Yes Sud 0.4 -0.3 SOE) = =()2) -0.8 =13 -0.5 -0.5 
No 94.3 0.4 0.3 0.5 0.2 0.2 0.7 1.3 0.5 0.5 
School Safety and Discipline (SS&D) population 
Parental educational level 
Less than high school graduate 9.4 0.5 =i Tes = O'S) -0.6 -2.4 -2.6 -0.6 = 192 
High school graduate or equivalent S24) 0.6 0.3 0.0 -0.2 -0.6 0.5 0.0 -0.3 -1.0 
Some college 57.9) 0.5 0.9 3 0.5 1.1 1.8 2.6 1.0 2x2) 
Mother's employment status 
No mother in household 35 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
Employed 35 hours/week or more 46.2 0.5 0.0 0.1 -0.1 0.1 0.0 0.2 =02 0.2 
Employed less than 35 hours/week 20.3 0.5 0.1 0.0 0.0 -0.1 0.2 0.0 0.0 -0.2 
Seeking employment 4.5 0.3 (0) Oz = -0.2 =OF7 =O -0.7 Ob 
Not in labor force 255 0.5 0.0 0.1 0.2 0.2 0.0 0.2 0.4 0.4 
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Table 4 
Estimated Reduction in Bias and Bias Ratio for Selected Characteristics of the NHES:93 — Concluded 


Standard estimate Estimated reduction in bias Bias ratio 


Characteristic 


Feunate Standard Scheme Scheme Scheme Scheme Scheme Scheme Scheme Scheme 


error Al A2 Bl B2 Al A2 Bl B2 

Father's employment status 

No father in household 26.8 0.6 =) -0.2 = (51 =(0)77 =-03 -0.3 -0.2 =(0)3) 

Employed 35 hours/week or more 63.2 0.5 0.6 0.9 0.6 0.8 12 1.8 92 1.6 

Employed less than 35 hours/week Shai O72 Oe AO 02 -0.2 SEO =Jl{0 =A) = 1140) 

Seeking employment 2.6 0.2 =02 -0.3 =O =(0}3) -1.0 =J 3) -1.0 =15 

Not in labor force 43 0.3 -0.1 -0.1 (0), =(0),1I (023) =(:3 =) -0.3 
School control 

Public O12 0.3 =O:i -0.1 =I -0.1 -0.3 =3' 053 -0.3 

Private 8.8 0.3 0.1 0.1 0.1 0.1 0.3 0.3 0.3 0.3 
Visitors required to sign in at school 

Yes DE 0.5 0.1 0.4 0.0 0.2 0.2 0.8 0.0 0.4 

No 20.1 0.5 Oe -0.4 0.0 =O -0.2 -0.8 0.0 -0.4 
Had drug or alcohol ed program this year 

Yes 68.5 0.7 0.6 0.8 0.7 0.9 0.9 ited 1.0 Iles 

No 31.5 0.7 -0.6 -0.8 =Or7 OL =) ler =J0 =1k3 
Students in fighting gangs at school* 

Yes 22.3 0.5 = (ks -0.4 {03} = O'S -0.6 -0.8 -0.6 =s1).0) 

No Wed 0.5 0.3 0.4 0.3 0.5 0.6 0.8 0.6 1.0 
Ease of obtaining marijuana at school* 

Very or fairly easy 392 0.6 O02 =(0).3) -0.2 =(0)3) =O) = O35 = (0) 3) -0.5 

Hard 2 OMT 0.5 0.1 0.1 0.2 0.2 0.2 0.2 0.4 0.4 

Nearly impossible 31.1 0.6 0.1 0.1 0.0 0.1 0.2 0.2 0.0 0.2 
Fear of incident of crime at school 

None 66.1 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 

Fear of theft or robbery” 11.9 0.5 (0)! =O 0.0 =)? = -0.4 0.0 -0.4 

Fear of bullying or assault® 8.6 0.3 =(),Il = Onl -0.1 =) ANS, =(0);3) =e! -0.3 

Fear of two or more types of incidents* 13.3 0.5 0.1 0.3 0.1 0.2 0.2 0.6 0.2 0.4 
Knowledge of crime at school 

None 38.7 0.6 0.2 0.1 0.2 0.1 0.3 0.2 0.3 0.2 

Fear of theft or robbery” 14.1 0.5 0.2 0.3 0.2 0.3 0.4 0.6 0.4 0.6 

Fear of bullying or assault® 15.6 0.4 =(0)5) -0.4 -0.4 -0.4 SANS) = 100) =10 NN 

Fear of two or more types of incidents” 31.6 0.6 0.1 0.0 0.0 0.0 0.2 0.0 0.0 0.0 
Victimization by crime 

Not victimized 73.0 0.5 0.3 0.2 0.3 0.2 0.6 0.4 0.6 0.4 

Victim of theft or robbery® 10.9 0.3 = 0:2 -0.1 =O. 0.0 -0.7 = (53) (0):3) 0.0 

Victim of bulling or assault° 8.9 0.3 -0.1 0.0 -0.2 -0.1 S055) 0.0 =\0)57/ SA) KS) 

Victim of two or more types of incidents° Tez. 0.3 0.0 0.0 0.0 -0.1 0.0 0.0 0.0 =O} 
Witnessed crime at school 

None 63.8 0.8 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 

Witnessed robbery*® 0.6 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 

Witnessed bulling or assault® 24.1 0.8 -0.3 =O)! =) -0.3 -0.4 -0.4 -0.4 -0.4 

Witnessed two or more types of incidents 11.4 0.4 0.0 0.1 0.0 0.0 0.0 0.2 0.0 0.0 


' Applies to preschoolers only. 

* Applies to all children except preschoolers. 

* Applies to children in primary school only, 

* Applies to students in grades 6 through 12 only. 

* For the fear of incident, knowledge of crime, and victimized by crime variables, the second response category is used if either theft or robbery was reported 
but not both, the third response category is used if either bullying or assault was reported but not both. 

* This response category is used if either bullying or assault was reported, but not both, was reported. 

Note: Percents may not add to 100 because of rounding. 

Source: U.S. Department of Education, National Center for Education Statistics, National Household Education Survey, spring 1993. 


Survey Methodology, December 1996 


where p, is the estimated proportion under the standard 
approach and 5, is the reduction in bias under scheme a. 
Because of the high correlation in the estimates of the bias 
from the four adjustment schemes, only the mean square 
errors for scheme Al were computed. In Brick et al. (1996), 
the estimates using other schemes are shown to have 
negligible effects. 

The mean square errors of the adjusted estimates are now 
contrasted with the variability in the standard NHES:93 
estimates. The variance increase from adjusting the weights 
using the telephone service interruption data was expressed as 
a VIF in table 3. Multiplying the variance estimates of the 
standard estimates by the appropriate adjustment factor yields 
an approximate variance for the adjusted (presumably 
unbiased) estimates which are then compared to the mean 
square error of the standard estimates. 

To aid in comparing the weighting procedures, ratios of the 
variance of the adjusted estimate to the mean square error for 
the standard estimate were tabulated (see Brick et al. 1996). 
The ratio is called the mean square ratio and can be written as 


 _ 100x relativeVIF, x var(p,) 


msr ,(P) RE (6) 

Note that the mean square error is derived using the bias 
estimated from scheme A1 only, but it is used to compute the 
mean square ratios for all four schemes. As noted above, this 
simplification does not have much effect on the mean square 
ratios because the bias estimates are approximately the same 
across schemes. 

The mean square ratios include contributions from the bias 
(in the mean square error estimates) and the variance (in the 
VIF). When the mean square ratio is 100, the variance of the 
adjusted estimate is exactly equal to the mean square error of 
the biased, standard estimate. A ratio less than 100 indicates 
that the bias reduction of the adjustment is greater than the 
variance increase that comes with it. A mean square ratio over 
100 means that the variance increase associated with the 
adjustment is greater than the bias reduction. 

Figures 4 and 5 graphically present the msr for the two 
component surveys using scheme A1. In addition, Table 5 
shows summary statistics for the msr for all four adjustment 
schemes. The distributions of mean square ratios for both 
components are very similar with the mean square ratios 
slightly lower for the SR component. The medians for 
schemes Al and B1 (those based on interruptions of one 
week or more) are near the break-even point of 100. The 
means for these schemes are close to 90 and the figures 
confirm that the difference between the mean and medians is 
due to the skewed distributions of the mean square ratios. 

A striking feature of the distributions of the mean square 
ratios for schemes Al and B1 is the size of the ratios at the 
extremes of the distribution. The maximum mean square ratios 
for both components is 120, while some ratios are as small 
as 26. This means the maximum increase in the mean square 
error of the estimates is 20 percent, while the reductions in 
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Figure 4. Estimated mean square ratios for selected School 
Readiness items (scheme A1) 
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Figure 5. Estimated mean square ratios for selected School 
Safety and Discipline items (scheme A1) 


mean square error for a number of other estimates are quite 
large. Thus, the penalty associated with adjusting even when 
the estimate is not biased is modest, but the benefits of 
adjusting when it is needed are impressive. 

The distributions for the mean square ratios for schemes 
A1 and B1 are very similar, and the choice of which of these 
schemes should be used may be determined by nonstatistical 
issues, such as availability of data and the other types of 
adjustments required in the survey. The mean square ratios 
show that the adjusted weights reduce the mean square error 
for about half the estimates considered below those derived 
from the standard weights. The distributions of the mean 
square ratios for schemes A2 and B2 (those based on 
interruptions of 1 month or more) have medians and means 
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Table 5 
Summaries of Distribution of Mean Square Ratios for Selected 
Characteristics of School Readiness and School 
Safety and Discipline Components 


Adjustment scheme 


Al A2 Bl B2 


School Readiness 


Mean 89.8 101.00 868 94.2 
Median 96.0 108.0 92.8 100.8 
Minimum 210 303 Onl SS 
Maximum 120.0 135.0 116.0 126.0 
School Safety and Discipline 
Mean 93731 L049 922 0319 
Median 100.8 113.4 999 112.5 
Minimum 204 9 6 295 
Maximum 1 O26: Oe EO m1 25:0 


Source: U.S. Department of Education, National Center for Education 
Statistics, National Household Education Survey, spring 1993. 


that are greater than 100. Essentially, these mean square ratios 
are shifted upward when compared with those of schemes A1 
and B1, and are not recommended. 


5. CONCLUSIONS 


If the percentage of the target population living in non- 
telephone households is relatively large and the characteristics 
of those persons are different from those who live in 
telephone households, then the estimates may be susceptible 
to significant coverage bias. One method of addressing this 
problem without resorting to other modes of data collection 
is to adjust the weights to reduce the coverage bias. In this 
study, the weights for persons in households reporting an 
interruption in telephone service were increased to account 
for those without telephones. 

The bias reduction estimates computed under the assumed 
model showed that the coverage adjustments for the SR 
component improved some of the estimates substantially, and 
did not do much harm to any statistics. The bias reduction 
estimates from the SS&D component, on the other hand, were 
not as substantively important. The adjustments reduced bias 
for both components, but they also increased the variability of 
the estimates. The distributions of the mean square ratios 
show that about half the estimates could be improved using 
the telephone service interruption adjustments. Furthermore, 
even for those estimates that were less accurate due to the 
variance increases associated with the differential weights, the 
magnitude of the increases were not large. In other words, the 
penalty for adjusting when it did not reduce the coverage bias 
was not very great. These findings suggest that the 
adjustments should be seriously considered. 

The alternative weighting schemes performed differently 
with respect to the mean square ratios. The schemes based on 


interruptions of telephone service of one week or more were 
better than the schemes based on interruptions of one month 
or more. The bias adjustments resulting from using 
educational attainment by race/ethnicity categories were 
roughly equivalent to those using tenure by race/ethnicity. 

The size of the sample is a relevant factor that should be 
taken into account when evaluating the use of the telephone 
service interruption adjustment. Bias ratios increase with the 
sample size because the bias is not affected while the sampling 
error of the estimate (the denominator of the bias ratio) 
decreases. Thus, the adjustments should be more beneficial in 
surveys with large sample sizes where the bias ratios might be 
expected to be large. 

While the results of this study suggest that the adjustments 
could be useful for many estimates from telephone surveys, 
confirmation is needed before the adjustments are recom- 
mended. As discussed earlier, the estimates of the mean 
square errors in this study were based on the assumption that 
the adjusted estimates eliminated the bias of the estimates. 
This model assumption could not be verified because of the 
lack of benchmark data for comparison. The assumed model 
is very beneficial to the adjusted estimates in the sense that it 
results in lower bounds on the mean square errors for the 
adjusted estimates. Thus, the findings of this study should be 
taken as an indication that adjustment using data on 
interruptions in telephone service is a feasible method, but 
requires further study and evaluation. 

The questions about interruptions in telephone service 
were recently added to the National Health Interview Survey, 
a survey conducted by the Census Bureau for the National 
Center for Health Statistics. The findings from this survey 
should be very useful in evaluating this method because the 
survey covers households without telephones by in-person 
interviews, eliminating the need for the critical model 
assumption used in this study. 
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Optimal Sample Redesign Under GREG in Skewed 
Populations With Application 


GURUPDESH S. PANDHER ' 


ABSTRACT 


Within a survey re-engineering context, the combined methodology developed in the paper addresses the problem of finding 
the minimal sample size for the generalized regression estimator in skewed survey populations (e.g., business, institutional, 
agriculture populations). Three components necessary in identifying an efficient sample redesign strategy involve 
i) constructing an efficient partitioning between the “take-all” and “sampled” groups, ii) identifying an efficient sample 
selection scheme, and iii) finding the minimal sample size required to meet the desired precision constraint(s). A scheme 
named the “Transfer Algorithm” is devised to address the first issue (Pandher 1995) and is integrated with the other two 
components to arrive at a combined iterative procedure that converges to a globally minimal sample size and population 
partitioning under the imposed precision constraint. An equivalence result is obtained allowing the solution to the proposed 
algorithm to be alternatively determined in terms of simple quantities computable directly from the population auxiliary 
data. Results from the application of the proposed sample redesign methodology to the Local Government Survey in Ontario 
are reported. A 52% reduction in the total sample size is achieved for the regression estimator of the total at a minimum 
coefficient of variation of 2%. 


KEY WORDS: Minimal sample size; Optimal sample selection; Precision constraint; Sampled group; Take-all group. 


1. INTRODUCTION 
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Although the work presented in this paper is motivated by a 


In many survey situations additional information is 
available on all population units before the survey is 
undertaken. This auxiliary information is frequently useful in 
devising a more efficient sample design and estimation 
strategy. In a survey redesign context, the most optimal 
strategy holds the promise of offering the largest reduction in 
survey costs by requiring the lowest sample size necessary to 
meet the desired precision constraint on the estimates. In 
repeat surveys of skewed populations, an efficient sample 
design and estimation strategy may be realized by exploiting 
a) the correlation structure between the size-based auxiliary 
information x (e.g., population of municipality, employees 
in a firm, farm acreage) and the survey variables y (e.g., 
municipality expenditures, value of shipments, farm yield) 
and b) the variance relationship between the survey variable 
and the auxiliary size information. 

In this paper, a comprehensive sample redesign meth- 
odology is developed for skewed populations with the 
ultimate objective of bringing about maximal reductions in 
the current sample size while ensuring a desired level of 
precision for the generalized regression estimator of the total. 
This work was motivated by the redesign of the Local 
Government Finance Survey (LGFS) conducted by Statistics 
Canada's Public Institutions Division. Financial information 
(e.g., revenues, expenditures, debt, etc.) obtained from local 
government units is used in the estimation and publication of 
financial statistics on a provincial and national basis. 


concrete application, the sample design methodology devised 
applies generally to all surveys based on skewed populations 
(e.g., agricultural, business, and institutional surveys). 

In identifying an efficient new sample design, the overall 
methodology addresses and integrates the solution to three 
problems: 


1) Creation of the “‘Take-all’’ and “Sampled Groups” 


Since the variability of the survey response y, tends to 
increase with the size of the unit x,, it is common in skewed 
populations to sample the largest x-valued units with certainty 
in order to improve the efficiency of the population 
estimators. The demarcation of the population into the non- 
overlapping “take-all” U, = {1,..., N,} and “sampled” groups 
U,={1,...,N,} is obtained through a new scheme named the 
“Transfer Algorithm’. 


2) Choosing an Efficient Sample Selection Scheme 


Let p(s; A) =(7,(s,), Pp(5;; 4)) represent the complete sample 
design where the sample design parameter A determines the 
type of sample selection implemented in the sampled group U,,. 
The sample inclusion probabilities due to p,(s,; A) may be 
expressed as 7, (A) = BiG, oe keU,. Note that the 
parameter A defines a broad class of sample designs with SRS 
(A = 0) and pps (A = 2) as particular cases. Design optimality 
results (Godambe and Joshi 1965) allow the identification of 
the most optimal value for the sample design parameter A. 


" Gurupdesh S. Pandher, Survey Analysis and Methods Development Section, Household Survey Methods Division, Methodology Branch, Statistics Canada, 
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3) Minimal Sample Size Determination 


The third component of the overall methodology is aimed 
at finding the minimal sample size required to meet the 
imposed precision constraints for the estimator. 

The combined procedure devised integrates these 
components to allow a new globally minimal sample size and 
optimal population partitioning to be determined under a 
flexible range of sample selection strategies (e.g., SRS, pps, 
generalized pps). Firstly, the “Transfer Algorithm’ is 
proposed which finds an optimal population allocation 
between the take-all and sampled population groups in the 
sense of minimizing the variance of the generalized 
regression estimator (GREG) of the total. Desirable math- 
ematical properties of this algorithm such as existence and 
optimality of solution along with an equivalence result were 
established in Pandher (1995). The equivalence result allows 
the solution to be determined in terms of simple quantities 
computable directly from the population auxiliary data. 

The Transfer Algorithm in then synthesized iteratively 
with the sample size determination step to find the minimal 
sample size needed to satisfy the imposed precision 
constraints through an iterative procedure. The combined 
methodology produces a sequence of sample sizes and 
population partitionings which converge to a globally optimal 
solution where further reductions in the sample size are not 
possible given the imposed precision constraint. An 
application of the procedure is given for Ontario using 
provincial data from the Local Government Finance Survey. 

Lavallée and Hidiroglou (1988), Hidiroglou and Srinath 
(1993) (subsequently denoted as L&H and H&S, respec- 
tively), and Glasser (1962) have proposed alternative 
methodologies for constructing the take-all and sampled 
groups within the context of stratified SRS design. The 
proposed approach differs from other methods in three 
respects. Firstly, the population demarcation is obtained under 
a flexible range of sample selection strategies (e.g., SRS, pps, 
generalized pps). Secondly, the criterion for constructing the 
population demarcation is based on minimizing the variance 
of the GREG estimator of the total under the desired sample 
selection strategy (Glasser and L&H base their allocation on 
minimizing the within-stratum sum-of-squares x; H&S use the 
total regression sum-of-squares under a regression model with 
a compulsory intercept assuming SRS). Thirdly, the proposed 
methodology explicitly captures the size-induced hete- 
roscedasticity present in skewed survey populations which 
has been ignored in other frameworks. 

Lastly, it is useful to qualify the sense in which the term 
“optimal” is used. Since, the redesign uses auxiliary infor- 
mation from a previous cycle of the survey to estimate the 
design parameters, there is a level of sub-optimality 
introduced in the redesign methodology by this lag. But as a 
practical matter, using the data from the most recent survey is 
the best that can be done. Once the design parameters have 
been estimated or are known however, the cut-offs and 
sample sizes required to achieve the desired precision yield 
the lowest anticipated design variance given that the estimates 


are true (or close to it). It is therefore, in this sense that the 
word “optimal” is used. 


2. SURVEY FRAMEWORK 


The model assisted survey framework is adopted for the 
skewed population whose auxiliary and survey characteristics 
are denoted by Cy = {(%, y,), ..-. %y, Yy)}. In this framework, 
underlying the class of generalized regression estimators for 
the population total are regression models (Sarndal 1992, 
p. 255) exploiting the correlation between the survey variables y 
and the auxiliary covariates x. Different model assumptions 
on the deterministic and stochastic components of the under- 
lying model lead to different regression estimators for the pop- 
ulation total. For example, a ratio-form heteroscedastic model 


Yea Pred ees (21) 


with the cate oe (0, 07) and the variance structure given 
by ane ~ Cx (y is the heteroscedasticity parameter) leads to 
the following GREG estimator: 


Fe _ B 
tay =. x, B+) ——— pine! (2.2) 


U, Sb Thy 


where B = (Y', y,/7,)/(%,x,/%,) is the sample-based prob- 
ability weighted estimate of the population regression para- 
meter B. 
Given this estimation framework, the total across both 
Brours t = t, + t, is estimated by peas ' a where i, = 
a= LU, y, Since all units are sampled in the take-all group and 
tp AS the GREG estimator under the relevant model. The 
anticipated variance of oe (defined as the variance with 
respect to both the design and the model, denoted p and , 
respectively) is expressible as 


a 1 
Vs (trp) = 3, a + ! Oy. (2.3) 


keU 


ViGap) =e 


Furthermore, if or depends on the sr measure x, 
according to the formulation Wee = CX, (2.4), then design 
optimality (Godambe and Joshi 1965) implies that the optimal 
sample inclusion probabilities are m,(y)«x,, keU,. 
Therefore, the sample design Py (8,34 = = Y) in the sampled 
sub- population, defining the first order inclusion probabilities 
Tbe n(x," "7 sa ),keU,, minimizes the anticipated 
variance V(t as 

In the model assisted framework used in this paper, the 
auxiliary measure x, is assumed to be a scalar. As noted by a 
referee, the more general case where x, is a vector could be 
handled by fitting the appropriate parametric relationship 
a, ah poe 9 Xqq) and using the estimated 6, in lieu of x, in 
defining the inclusion probabilities. The approach for the 
multivariate x, seems intuitively sound and is mentioned here 
for completeness but requires further study and investigation. 
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Three methods for estimating the heteroscedasticity 
parameter y from past survey data called the “Least Squares 
Method”, the “Maximum Likelihood Method”, and the 
“Graphical Method” are described in Appendix A of Pandher 
(1995). 


3. TRANSFER ALGORITHM 


In this section, an iterative scheme named the “Transfer 
Algorithm” is proposed to determine the optimal demarcation 
between the take-all and sampled sub-populations under the 
sample design p(s;A). The criterion for this construction is 
based on finding a population partitioning minimizing the 
estimated anticipated variance of ae An equivalence result 
from Pandher (1995) is used to find an alternative and simpler 
method of solution based entirely on quantities defined on the 
auxiliary population data. 

The proposed scheme for constructing the take-all and 
sampled sub-populations, U, and U,, respectively, is based on 
the following idea. Initially, place all population units in the 
sampled group, labelling it U 3) (the superscript / represents 
the iteration cycle). Hence, the take-all group is an empty set 
U soy ={o@}. The resulting population and sample size 
allocation at / = 0 is given by Nie =\()) re = ()) No =N, 
and lin =n, where ny is the current sample size. 

In a repeat survey setting, the variances o; in (2.3) can be 
empirically modelled using the relation a; =cx; (2.4) where 
y and c are estimated from previous sample data as mentioned 
before. Using the estimated version of (2.4) in (2.3) yields the 


A 


following estimator for V(t,,;*): 


WENO no) = | = ! éx? (3.1) 


vey \ A) 


where the largest / x-valued units have been removed from 
Spas Note that A is used here to parameterize the sample 
design to allow greater generality when A = y. 

In the iterative algorithm, we start initially with all 
population units placed in thug Then for each iteration 
1,0 < 1<n, the largest 1+ 1 x-valued unit xj_),) is transferred 
from UL to us and the difference 


A 


AOEV EGAN gn =1—1) 


A 


~ VG sAyN=-Ln~1) (3.2) 


is computed. Negative values of A(J) mean that the transfer 
of the unit corresponding to the ordered value x,y_, ,) lead to 
a decrease in the variance. Moreover, such transfers continue 
to result in a reduction in the variance of Pe as long as 
A(J) <0. In general, for any iteration /, the relationship 
between the population and sample size allocations is 
described by the following relations: N =N-1, ny” =i) =I. 
and N ve n Ue 1. These relations hold because the overall 
population and sample sizes must remain constant 
(N=N Ds NY and n=n x + a ) for all iterations. 
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The solution is also constrained by the condition 
T,(A)<1,ke U, (1*). Let 1’ (A), 0 < 1’ <n, represent the 
solution to the Transfer Algorithm. Given the discussion 
above, the solution to the Transfer Algorithm under the 
sample design p(s;A) may be formulated as 


IQ) =min{l: [y.y@)<1] and 


A A 


Ad = (V"' G34) - VO Gy34)] = 0,0 < L<n}. 3.3) 


The optimal population allocation to the take-all group 
U,, (L*) is then given by the population units coinciding with 
the /* ordered units transferred to the take-all auxiliary vector 
Xp = e-t)» Xq-i-+1)> - Xqy) 3} Correspondingly the sampled 
group U, (I*) consists of the units corresponding to X," = 
ay» Xa)» Xw-1--») ake, , 
Transferring a unit from U,’ to U,” causes two opposite 
effects on the variance V(f,,;:). The reduction in the 
population size (N a =N A - 1)has the impact of decreasing 
the variance, while the equivalent reduction in the sample size 
CA - ny - 1) has the reverse effect of increasing 
VO(t pp; ‘). Somewhere in this process, a critical value 
I", 0 < 1" <n, exists which gives the optimal breakdown 
{U,(1*), U,(1*)}. Moreover, in Theorem 3 of Pandher 
(1995), it is shown that as long as the conditions 
Gorn a ey) 2 0 and Caen — tevin) = 0,0 <1 <7, hold, 
a solution to the Transfer Algorithm exists and that the system 
remains stable (optimal) upon reaching /*. Stability further 
implies that the solution is optimal since the conditions leading 
to the solution do not change in the range 1“ < 1<n. These 
two properties may be more precisely defined as follows: 


Existence: 41*,0<1* <n, such that V"") - V™>0 
and m),.)<1. 

Stability: If V“*? -V™ > 0, then V+ - W930 
and eae for0<l*<l<n. 


An example of the application of the Transfer Algorithm to 
the LGF survey population of local municipalities in Ontario 
(with N = 793, n = 108, y = 2, and A = 1) is given in Figure 1. 
The curves are plotted for / > 8 because in the interval 
0 <1 < 8, the first condition of (3.3), namely [Ty_p(A) <7, |" 
is not satisfied. The minimum value of V (tp,) 18s achieved 
at 1” = 57 where A(I*) = V6 “9 - VO) 5 0. 


x 109 
1 


A(/) 


Fe ‘1Omnnns 20 ners Oleml4 Or SCuun CON 700 05b80) i790 008500 
Figure 1. Changes in variance of regression estimator 
A=1):AD=V%R 1,N-1= 150 -1-1)- 
Vn 1,N-1n-1) 
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Theorem 2 from the complete paper is an important result 
which allows the solution to the Transfer Algorithm to be 
equivalently expressed in terms of simpler quantities based on 
the auxiliary data. A brief sketch of the development of this 
theorem is given in the Appendix. 


Theorem 2. Equivalent Solution to the Transfer 
Algorithm 


The solution /*(A) to the Transfer Algorithm stated in (3.3) 
in terms of V® - V“” and Ty-p(A) may also be equivalently 
expressed as 


mained PESWRTS Yie M2) Dil in) Ciara 
OD) minqi:n-1< R(;y/2),0<l<n},A=y 


1 oa balkableragi AU NRO e 


where R(I;y - 42) =Sy xd) Indy and R(I;A/2) = 
ye x0) /Xqy-p define the critical values. 

This use of this theorem to find the optimal population 
allocation is illustrated graphically in Figure 2 (Ontario data). 
In this case, 0 < A < y, and the solution is determined by the 
behaviour of the functions R(/; y — 4/2) (the lower curve in 
the graph) and n - J. The same solution /* = 57 is obtained as 
before. 


46 80 120 160 200 240 280 320 


D) 20 40 60 80 100° 120 
/ (units transferred) 

Figure 2. Use of R(J; y - 4/2), R(I;A/2), and (n - 1) to construct 

optimal take-all/sampled groups (Ontario) 


4. SAMPLE SIZE DETERMINATION 
AND COMBINED 
ITERATIVE PROCEDURE 


Given a sample design p(s, A), 0 < A < 2y, with sample size 
n, the Transfer Algorithm yields an optimal construction of the 
take-all and sampled sub-populations, U;(l *) and U, (I AG 
respectively. Next, an expression for finding the minimal 
sample size is obtained which meets the imposed precision 
constraint — expressed in terms of the coefficient of variation 
CV.,i,. The sample determination step is then integrated with 
the Transfer Algorithm to develop a combined procedure 
which allows the survey designer to find the globally minimal 
sample size and optimal population partitioning. 


4.1 Expression for New Sample Size 


Let g represent the iteration cycle for the combined proce- 
dure and n, =n,,+n,, denote the total minimal sample size 
required to satisfy the precision constraint. Given the sample 
design Py (s,A, l, inn @)? current sample size n,, and the 
population partitioning {U,, qq a) U;, AUP aie the precon 
constraint for tp =1t, + tp, may be stated formally as 

a2a F 
Oey Ve (taps AN -1,, ’ 
mn — A 

tR 


cv liable (4.1) 


Solving this inequality for Nya gives the following expression 
for the minimal sample size needed in the sampled group 
U,,(1,) to meet the precision constraint: 


, ade X (I, ,A/2) X(,,9 - A/2)E 
Ng Ng, Oy) = rd (4.2) 
ts CVigeiX Qe 


where X(I7,42)= Yt xe?, XC 9 - M2) = Tex, 
and fe may ibe estimated from past survey data corresponding 
to the period of the auxiliary information. The total new 
minimal sample size required to meet the precision constraint 


is then given by 


* * * * * 
Ne Naat Ng Shae) ely as (4.3) 


4.2 Combined Sample Redesign Methodology 


Next, note that the solution to the Transfer Algorithm L 
depends on the current total sample size: l, “(A)= l, “(A,n re 
Once the new minimal sample size n, is determined, the 
existing partitioning {U, as ay: Us (lq ")} which was optimal at 
n, is no longer optimal at the new minimal sample size n, 
ee u PAs n, ETA G n,) if n, te Therefore, letting 
Ny =Ng.a new population partitioning from the Transfer 
a Con based on | ales »N,.,), given by {U, aegst)» 

U, Waal als is required to optimize the construction of 
the take-all and sampled sub-populations. Next, applying 
(4.2) over Up.g-1(lge1) gives a new minimal sample size 
nae Ly.1Mq.1)* Ny, g-1 Tequired to achieve the desired 
precision CV,,,. Proceeding in this fashion, the combined 
scheme produces a sequence of population partitionings, 
sample sizes, and sample allocations 
(1°(A, 14), "qq 


= 1p Mpg = Ng - 1p), 


(Nag - Dod ps =N- 1) ("qq = Ly sMyq))s q=0,1,... (4.4) 

with n,., =n, =n, +m,, and the initial value n, (current 

survey sample size). The combined procedure is repeated 

until further reductions in the minimal sample size can no 

longer be achieved. This leads to the stopping rule 

q° =min{q:n,., # 
q 


n, 2 0}. (4.5) 
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The optimality of the combined procedure can be 
established using Theorem 2 and is omitted here due to space 
(see Pandher 1995). The main result is that the combined 
procedure converges to a globally optimal solution along the 
path defined by (4.4) to a point where further reductions in 
the sample size are not possible (by reconstructing U, and 
U,, ) given the imposed precision constraint. 


5. APPLICATION 


The combined sample design procedure described above 
is now applied to the redesign of the Local Government 
Finance Survey in the province of Ontario. The survey 
response y in this application is the actual expenditures 
reported for sampled local government units for Ontario in 
1989. The actual estimates are prepared 30 months after the 
end of the survey year from financial statements submitted by 
the local government units to the Department of Municipal 
Affairs (provincial). Population counts for the local 
government units from the nearest census (1991) are used as 
the auxiliary variable x. The population of local-level 
municipalities for Ontario consists of a total of 793 units of 
which a sample of 108 units is currently taken. 

The results of applying the combined methodology to 
Ontario LGFS data are reported in Table 1. The level of 
desired precision CV,,,, was set at 2% for the total regression 
estimator be AP +tp ,- Using the methods of Pandher (1995), 
the best Hanes ae the heteroscedasticity parameter y in 
Ontario was determined to be 7 = 2; the corresponding pro- 
portionality constant was estimated to be ¢ = .0825. The near 
optimal sample design defined by A = 7 (p(s; ¥)) was used. 


Table 1 


Application of Combined Methodology to LGF Survey Data 
(Ontario, 1989) 


* * * 


Iteration (q) Ny 1,(A, nq) Nag Nyg n, 
0 108 39 39 18 57 
1 57 16 16 34 50 
2 50 12 192 38 50 


For Ontario the combined scheme stopped at iteration 
q =2. The globally optimal population partitioning between 
the take-all and sampled groups is N, = 16 and N, = aL 
The new minimal total sample size is n° = 50 with allocations 
n, = = 16 and n, = 34. A total sample size reduction of 
Ny - n, = 108 - 50 =58 is achieved at the desired CV of 2% 
for the regression estimator i =t + ys b: 


6. CONCLUDING REMARKS 


This paper provides a comprehensive methodology for 
identifying and implementing an efficient sample design for 
recurrent surveys of skewed populations. The combined 
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procedure integrates the solution to the following three 
problems: i) identifying an efficient sample selection scheme, 
ii) constructing an efficient demarcation between the take- 
all and sampled population groups at a given sample size, and 
iii) determining the minimal sample size required to meet the 
precision constraint(s). 

The equivalence result to the Transfer Algorithm (Pandher 
1995) was used to create the take-all and sampled groups. The 
first two components were then combined with a sample size 
determination step through an iterative procedure. Under the 
stoping rule, the combined iterative procedure converges to a 
globally minimal sample size and optimal population 
partitioning. Results from the application of the proposed 
sample redesign methodology to the Local Government 
Survey in Ontario were reported. A 52% reduction in the total 
sample size was achieved for the regression estimator of the 
total (t NE es te) at the desired precision of CV = 2%. 
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APPENDIX 


A brief sketch of the development behind Theorem 2 
(Equivalence Result) is given here; for technical details see 
Pandher (1995). The same paper also establishes the desirable 
mathematical properties of the Transfer Algorithm such as 
existence and optimality of solution as well as the optimality 
of the combined procedure. 

Using the expression for the variance of V(¢,,;°) given 
in (3.1), the difference V“*? - V® may be expressed as 


yen po, AOBO _ (A.1) 
= Py Ge 1= 1) 
where 
Tana M2 
OEE melt, D) Xwv-n 
i 
and 


N-l 
y-Al2 y-A/2 
B(D= yy xy (=i) Xw-) ° 
k=l 


The condition B(l) <0 may also be expressed as 
n- 1>R(I; y - A/2) where R(1; 01) = Py; Xq/Xqy-y - Similarly, 
the condition A(J) > 0 corresponds to n - 1 < R(I; A/2). All 
possible states of the system defined by the Transfer 
Algorithm are summarized in Table A.1. 
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Table A.1 
Outcomes for V“*) - VW! <0 and VOR - VO 50 
in Terms of n =n - 1 


Ved yO<0 


B i fA and B 
ENE E) es Condition on n =n - | 


A(l) >0 R(; y - 4/2) <n-1<R(l; 4/2) 
Ba) <0 (T.1) 

A() <0 R(I;A/2) <n - 1<R(lsy - 4/2) 
Bi) >0 (3:3) 

Va V oO 
Condition on n =n - | 

A(l) >0 n-—I< min{R(I;A/2), R(l;y - A/2)} 
B() >0 (T.2) 

A() < 0 n-I > max{R(Il;A/2), R(1;y - A/2)} 
BD) <0 (T.4) 


The first column describes the behaviour of A(J) and B(J) 
leading to the outcome V"") - V <0 and V") - Vs 0, 
respectively. The second column describes the equivalent 
condition in terms of n” =n - 1, R(l; y - A/2), and R(I; 4/2) 
corresponding to V2-V@Y.<0 and V®-V&)>: 0, 
respectively. An important condition required for the solution 
to the Transfer Algorithm /*(A) is that Tow nA) <1 hold. It 
is easy to verify that Tew_p(A) <1= A(J)>0. In terms of the 
description for the Transfer Algorithm given in Table A.1, 
this condition means that the solution can occur only when 
both A(/) > 0 and BJ) = 0 or, equivalently, when n - | 
satisfies condition (T.2). 

Table A.1 completely enumerates all possible states of the 
system defined by the Transfer Algorithm. The correspon- 
dence between the internal cell quantities (computable 
directly from the auxiliary data and estimated parameters) and 
the margins (A(), B(), V“*” - V) represents a tautology 


which leads directly to Theorem 2 (Equivalence Result). The 
behaviour of the system described in the table also depends 
on the sample design p(s; A) employed. The three relevant 
cases are: 


a) O<A<y=[R(; y - A/2) < RU; A/2)], 
b) A=y = [RUC,; y - A/2) = RU; A/2)], and 
c) y<A=[R(; y - 4/2) > RI; A/2)). 


In case a) the system starts (/ = 0) in state (T.4), moves to 
(T.1) and then finally rests in state (T.2); state (T.3) is 
infeasible here. The solution to the Transfer Algorithm /"(A) 
is given by the smallest / leading the system to move into state 
(T.2). In case b), the system starts in state (T.4) and moves to 
(T.2); (T.1) and (T.3) do not apply. Finally, in case c), the 
transition path is from (T.4) to (T.3) to (T.2); here (T.1) is 
invalid. 
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